Scaling Laws for Neural Language Models¶
Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei.
Published: Jan 23, 2020 (arXiv) (Hugging Face)
Abstract¶
The paper studies empirical scaling laws relating language model performance (cross-entropy loss) to:
- Model size (N)
- Dataset size (D)
- Total training compute (C)
Across more than 7 orders of magnitude in scale, the loss follows smooth power-law trends. All models examined (mostly Transformers) exhibit predictable scaling behavior.
Key Concepts¶
Performance Measures¶
- Loss (L): average cross-entropy (in nats) over held-out text.
- Dataset size (D): number of tokens used for training.
- Model size (N): number of parameters excluding embeddings.
- Compute budget (C): estimated total FLOPs used in training.
- Critical batch size \((B_{\text{crit}})\): batch scale where inefficiencies balance.
Empirical Scaling Laws¶
1) Dependence on Model Size¶
When dataset is large enough and training converged:
- \(( \alpha_N \approx 0.076 )\)
- Loss decreases smoothly as model parameters increase.
- Shape (depth vs width) matters very little if (N) fixed.
2) Dependence on Dataset Size¶
With sufficiently large models trained until early stopping:
- \(( \alpha_D \approx 0.095 )\)
- Doubling dataset yields predictable improvements.
3) Dependence on Compute¶
When allocated compute is limited but models, batches, and data are tuned optimally:
- \(( \alpha_{\text{min}} \approx 0.050 )\)
- Indicates best performance comes from large models trained less than full convergence under a fixed compute limit.
Sample Efficiency Insights¶
- Larger models tend to be more sample-efficient—achieving better loss with fewer examples processed.
- Optimal training does not mean convergence; instead stop training early for best compute efficiency.
Universal Overfitting Behavior¶
The paper finds that if model size and dataset size are not scaled together:
- \((N_c, D_c)\) are constants fitted from data.
- Shows how overfitting penalty can be analytically approximated.
Optimal Compute Allocation¶
The study derives approximate relationships demonstrating how to best spend a compute budget (C):
- Increase model size rather than train small model longer.
- Increase dataset sublinearly to match model increase to avoid wasted compute.
- Early stopping is typically optimal.
Architecture Effects¶
- Changing width, depth, or attention heads yields small effects compared to scaling total parameter count.
- Within studied ranges, scaling dominates architectural tweaks.
Summary of Main Scaling Exponents¶
| Factor | Exponent | Effect |
|---|---|---|
| Model size (N) | ~0.076 | Larger models → lower loss |
| Dataset size (D) | ~0.095 | More data → lower loss |
| Compute (C) | ~0.050 | More compute → lower loss |
(Quantities vary slightly depending on dataset and training configurations.)
Conclusions¶
- Language modeling loss scales smoothly and predictably with model size, data size, and compute.
- Optimal use of resources implies using very large models and not training them to full convergence with enormous data.
- These scaling laws have major implications for efficiently training large language models.
Why It Matters¶
These findings informed later work like Chinchilla / Training Compute-Optimal Models, showing that scaling is more important than training to convergence or hyper-optimization alone. (Emergent Mind)