Skip to content

Scaling Laws for Neural Language Models

Authors: Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei.

Published: Jan 23, 2020 (arXiv) (Hugging Face)


Abstract

The paper studies empirical scaling laws relating language model performance (cross-entropy loss) to:

  • Model size (N)
  • Dataset size (D)
  • Total training compute (C)

Across more than 7 orders of magnitude in scale, the loss follows smooth power-law trends. All models examined (mostly Transformers) exhibit predictable scaling behavior.


Key Concepts

Performance Measures

  • Loss (L): average cross-entropy (in nats) over held-out text.
  • Dataset size (D): number of tokens used for training.
  • Model size (N): number of parameters excluding embeddings.
  • Compute budget (C): estimated total FLOPs used in training.
  • Critical batch size \((B_{\text{crit}})\): batch scale where inefficiencies balance.

Empirical Scaling Laws

1) Dependence on Model Size

When dataset is large enough and training converged:

\[ L(N) \propto \frac{1}{N^{\alpha_N}} \]
  • \(( \alpha_N \approx 0.076 )\)
  • Loss decreases smoothly as model parameters increase.
  • Shape (depth vs width) matters very little if (N) fixed.

2) Dependence on Dataset Size

With sufficiently large models trained until early stopping:

\[ L(D) \propto \frac{1}{D^{\alpha_D}} \]
  • \(( \alpha_D \approx 0.095 )\)
  • Doubling dataset yields predictable improvements.

3) Dependence on Compute

When allocated compute is limited but models, batches, and data are tuned optimally:

\[ L(C_{\text{min}}) \propto \frac{1}{C_{\text{min}}^{\alpha_{\text{min}}}} \]
  • \(( \alpha_{\text{min}} \approx 0.050 )\)
  • Indicates best performance comes from large models trained less than full convergence under a fixed compute limit.

Sample Efficiency Insights

  • Larger models tend to be more sample-efficient—achieving better loss with fewer examples processed.
  • Optimal training does not mean convergence; instead stop training early for best compute efficiency.

Universal Overfitting Behavior

The paper finds that if model size and dataset size are not scaled together:

\[ L(N,D) \approx \left(\frac{N_c}{N}\right)^{\alpha_N} + \left(\frac{D_c}{D}\right)^{\alpha_D} \]
  • \((N_c, D_c)\) are constants fitted from data.
  • Shows how overfitting penalty can be analytically approximated.

Optimal Compute Allocation

The study derives approximate relationships demonstrating how to best spend a compute budget (C):

  • Increase model size rather than train small model longer.
  • Increase dataset sublinearly to match model increase to avoid wasted compute.
  • Early stopping is typically optimal.

Architecture Effects

  • Changing width, depth, or attention heads yields small effects compared to scaling total parameter count.
  • Within studied ranges, scaling dominates architectural tweaks.

Summary of Main Scaling Exponents

Factor Exponent Effect
Model size (N) ~0.076 Larger models → lower loss
Dataset size (D) ~0.095 More data → lower loss
Compute (C) ~0.050 More compute → lower loss

(Quantities vary slightly depending on dataset and training configurations.)


Conclusions

  • Language modeling loss scales smoothly and predictably with model size, data size, and compute.
  • Optimal use of resources implies using very large models and not training them to full convergence with enormous data.
  • These scaling laws have major implications for efficiently training large language models.

Why It Matters

These findings informed later work like Chinchilla / Training Compute-Optimal Models, showing that scaling is more important than training to convergence or hyper-optimization alone. (Emergent Mind)