Variance Control via Weight Rescaling in LLM Pre-training
Abstract
The outcome of Large Language Model (LLM) pre-training strongly depends on weight initialization and variance control strategies. Although the importance of initial variance control has been well documented in neural networks in general, the literature on initialization and management of its growth during LLM pre-training, specifically, is somewhat sparse. In this paper, we introduce the Layer Index Rescaling (LIR) weight initialization scheme, and the Target Variance Rescaling (TVR) variance control strategy. Experiments on a 1B parameter LLaMA model demonstrate that better variance management using these techniques yields substantial improvements in downstream task performance (up to 4.6% on common pre-training benchmarks) and reduces extreme activation values, thus mitigating challenges associated with quantization and low-precision training. Our code is available at: https://github.com/bluorion-com/weight_rescaling.
Community
๐ Controlling Weight Variance for Better LLM Performance ๐
We trained over ๐ฐ๐ฌ ๐ผ๐ป๐ฒ-๐ฏ๐ถ๐น๐น๐ถ๐ผ๐ป-๐ฝ๐ฎ๐ฟ๐ฎ๐บ๐ฒ๐๐ฒ๐ฟ ๐๐๐ฎ๐ ๐ ๐บ๐ผ๐ฑ๐ฒ๐น๐ ๐ณ๐ผ๐ฟ ๐ญ๐ฌ๐ฌ ๐๐ถ๐น๐น๐ถ๐ผ๐ป ๐ง๐ผ๐ธ๐ฒ๐ป๐ and discovered that ๐ฐ๐ผ๐ป๐๐ฟ๐ผ๐น๐น๐ถ๐ป๐ด ๐๐ฒ๐ถ๐ด๐ต๐ ๐๐ฎ๐ฟ๐ถ๐ฎ๐ป๐ฐ๐ฒ ๐ฎ๐ ๐ถ๐ป๐ถ๐๐ถ๐ฎ๐น๐ถ๐๐ฎ๐๐ถ๐ผ๐ป ๐ฎ๐ป๐ฑ ๐ฑ๐๐ฟ๐ถ๐ป๐ด ๐ฝ๐ฟ๐ฒ-๐๐ฟ๐ฎ๐ถ๐ป๐ถ๐ป๐ด is crucial for improving downstream task performanceโleading to gains of up to ๐ฐ.๐ฒ% ๐ผ๐ป ๐ฐ๐ผ๐บ๐บ๐ผ๐ป ๐ฏ๐ฒ๐ป๐ฐ๐ต๐บ๐ฎ๐ฟ๐ธ๐! ๐
To achieve this, we introduce:
โ
Layer Index Rescaling (LIR) โ a weight initialization scheme
โ
Target Variance Rescaling (TVR) โ a variance control strategy
Beyond performance gains, these techniques also help reduce extreme activation values, mitigating risks in quantization and low-precision training for LLMs.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Scale-Distribution Decoupling: Enabling Stable and Effective Training of Large Language Models (2025)
- HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization (2025)
- AdaGC: Improving Training Stability for Large Language Model Pretraining (2025)
- Binary Neural Networks for Large Language Model: A Survey (2025)
- Peri-LN: Revisiting Layer Normalization in the Transformer Architecture (2025)
- A Good Start Matters: Enhancing Continual Learning with Data-Driven Weight Initialization (2025)
- Hyperspherical Normalization for Scalable Deep Reinforcement Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper