Post
2456
# Thoughts on Neural Scaling Laws
When you take a zoomed-out perspective view on the success goals of neural networks, you see they all revolve around the Scaling Laws - empirical observations that performance improves with increased model size, dataset, and compute resources.
The specifics of how these laws apply, vary for different modalities and architectures. This is notable in the empirical equations used to measure these laws.
Yet they all heavily rely on three main factors - Data, Size and Computation. These factors themselves also have sub-dependencies - data size & quality, model size & architecture, num of GPUs & code for compute kernels respectively.
As research in these laws progresses, we begin to see new scaling laws emerge that may apply in much different ways than usual. This is typical in recent local LLMs (Phi-3, Gemma 2B, LLMs in a flash) which shows small sized models with small rich quality data beating large models
I look forward to the singularity moment - when these laws take a full round spin and meet at where it all began:)
References:
- Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361
- Scaling Laws for Autoregressive Generative Modeling: https://arxiv.org/abs/2010.14701
- LLMs in a flash: https://arxiv.org/abs/2312.11514
- Phi-3 Technical Report: https://arxiv.org/abs/2404.14219
- Gemma 2B: https://arxiv.org/pdf/2403.08295
When you take a zoomed-out perspective view on the success goals of neural networks, you see they all revolve around the Scaling Laws - empirical observations that performance improves with increased model size, dataset, and compute resources.
The specifics of how these laws apply, vary for different modalities and architectures. This is notable in the empirical equations used to measure these laws.
Yet they all heavily rely on three main factors - Data, Size and Computation. These factors themselves also have sub-dependencies - data size & quality, model size & architecture, num of GPUs & code for compute kernels respectively.
As research in these laws progresses, we begin to see new scaling laws emerge that may apply in much different ways than usual. This is typical in recent local LLMs (Phi-3, Gemma 2B, LLMs in a flash) which shows small sized models with small rich quality data beating large models
I look forward to the singularity moment - when these laws take a full round spin and meet at where it all began:)
References:
- Scaling Laws for Neural Language Models: https://arxiv.org/pdf/2001.08361
- Scaling Laws for Autoregressive Generative Modeling: https://arxiv.org/abs/2010.14701
- LLMs in a flash: https://arxiv.org/abs/2312.11514
- Phi-3 Technical Report: https://arxiv.org/abs/2404.14219
- Gemma 2B: https://arxiv.org/pdf/2403.08295