For years the default instinct was simple: make the model bigger. This paper argues that for a fixed compute budget, parameters and training tokens should scale together in roughly equal measure, and that many headline models were trained on far too little data for their size.
The core finding
By fitting loss curves across hundreds of training runs, the authors estimate an optimal ratio between model parameters and tokens. The takeaway: a smaller model trained on more data can outperform a larger, undertrained one at the same cost.
Why it matters
The result reframes efficiency. Inference cost scales with parameters, so a compute-optimal smaller model is also cheaper to serve, a rare win-win for labs and product teams alike.