Scaling Laws for Neural Language Models¶

Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei.

Published: Jan 23, 2020 (arXiv) (Hugging Face)

Abstract¶

The paper studies empirical scaling laws relating language model performance (cross-entropy loss) to:

Model size (N)
Dataset size (D)
Total training compute (C)

Across more than 7 orders of magnitude in scale, the loss follows smooth power-law trends. All models examined (mostly Transformers) exhibit predictable scaling behavior.

Key Concepts¶

Performance Measures¶

Loss (L): average cross-entropy (in nats) over held-out text.
Dataset size (D): number of tokens used for training.
Model size (N): number of parameters excluding embeddings.
Compute budget (C): estimated total FLOPs used in training.
Critical batch size \((B_{\text{crit}})\): batch scale where inefficiencies balance.

Empirical Scaling Laws¶

1) Dependence on Model Size¶

When dataset is large enough and training converged:

\[ L(N) \propto \frac{1}{N^{\alpha_N}} \]

\(( \alpha_N \approx 0.076 )\)
Loss decreases smoothly as model parameters increase.
Shape (depth vs width) matters very little if (N) fixed.

2) Dependence on Dataset Size¶

With sufficiently large models trained until early stopping:

\[ L(D) \propto \frac{1}{D^{\alpha_D}} \]

\(( \alpha_D \approx 0.095 )\)
Doubling dataset yields predictable improvements.

3) Dependence on Compute¶

When allocated compute is limited but models, batches, and data are tuned optimally:

\[ L(C_{\text{min}}) \propto \frac{1}{C_{\text{min}}^{\alpha_{\text{min}}}} \]

\(( \alpha_{\text{min}} \approx 0.050 )\)
Indicates best performance comes from large models trained less than full convergence under a fixed compute limit.

Sample Efficiency Insights¶

Larger models tend to be more sample-efficient—achieving better loss with fewer examples processed.
Optimal training does not mean convergence; instead stop training early for best compute efficiency.

Universal Overfitting Behavior¶

The paper finds that if model size and dataset size are not scaled together:

\[ L(N,D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} \]

\((N_c, D_c)\) are constants fitted from data.
Shows how overfitting penalty can be analytically approximated.

Optimal Compute Allocation¶

The study derives approximate relationships demonstrating how to best spend a compute budget (C):

Increase model size rather than train small model longer.
Increase dataset sublinearly to match model increase to avoid wasted compute.
Early stopping is typically optimal.

Architecture Effects¶

Changing width, depth, or attention heads yields small effects compared to scaling total parameter count.
Within studied ranges, scaling dominates architectural tweaks.

Summary of Main Scaling Exponents¶

Factor	Exponent	Effect
Model size (N)	~0.076	Larger models → lower loss
Dataset size (D)	~0.095	More data → lower loss
Compute (C)	~0.050	More compute → lower loss

(Quantities vary slightly depending on dataset and training configurations.)

Conclusions¶

Language modeling loss scales smoothly and predictably with model size, data size, and compute.
Optimal use of resources implies using very large models and not training them to full convergence with enormous data.
These scaling laws have major implications for efficiently training large language models.

Why It Matters¶

These findings informed later work like Chinchilla / Training Compute-Optimal Models, showing that scaling is more important than training to convergence or hyper-optimization alone. (Emergent Mind)