Parallel Scaling Law for Language Model

Yet Another Scaling Law beyond Parameters and Inference Time Scaling

Checkpoints

All the released checkpoints were trained on public datasets and are for academic use only.

✨ are our recommendation for strong models.

Base models for scaling training data to 1T tokens

These models demonstrate strong competitiveness among existing small models, including SmolLM, gemma, and Llama-3.2 (see Table 4 for details).

Model	Description	Download
ParScale-1.8B-P1	✨ Baseline $P=1$	🤗 ParScale/ParScale-1.8B-P1
ParScale-1.8B-P2	✨ ParScale $P=2$	🤗 ParScale/ParScale-1.8B-P2
ParScale-1.8B-P4	✨ ParScale $P=4$	🤗 ParScale/ParScale-1.8B-P4
ParScale-1.8B-P8	✨ ParScale $P=8$	🤗 ParScale/ParScale-1.8B-P8

Instruct models for scaling training data to 1T tokens

We post-trained the aforementioned base model on SmolTalk-1M to enable conversational capabilities.

Model	Description	Download
ParScale-1.8B-P1-Inst	✨ Baseline $P=1$	🤗 ParScale/ParScale-1.8B-P1-Inst
ParScale-1.8B-P2-Inst	✨ ParScale $P=2$	🤗 ParScale/ParScale-1.8B-P2-Inst
ParScale-1.8B-P4-Inst	✨ ParScale $P=4$	🤗 ParScale/ParScale-1.8B-P4-Inst
ParScale-1.8B-P8-Inst	✨ ParScale $P=8$	🤗 ParScale/ParScale-1.8B-P8-Inst

Continual Pretraining Qwen-2.5-3B

We froze the parameters of Qwen-2.5-3B and only fine-tuned the newly introduced parameters on Stack-V2-Python. Since the following models share the same backbone parameters as Qwen-2.5-3B, they have the potential for dynamic parscale: switching P to adapt model capabilities during inference.

Model	Description	Download
ParScale-Qwen-3B-P2-Python	✨ ParScale $P=2$	🤗 ParScale/ParScale-Qwen-3B-P2-Python
ParScale-Qwen-3B-P4-Python	✨ ParScale $P=4$	🤗 ParScale/ParScale-Qwen-3B-P4-Python
ParScale-Qwen-3B-P8-Python	✨ ParScale $P=8$	🤗 ParScale/ParScale-Qwen-3B-P8-Python

For full pretraining on Stack-V2-Python

Model	Description	Download
ParScale-QwenInit-3B-P1-Python	Baseline $P=1$	🤗 ParScale/ParScale-QwenInit-3B-P1-Python
ParScale-QwenInit-3B-P2-Python	ParScale $P=2$	🤗 ParScale/ParScale-QwenInit-3B-P2-Python
ParScale-QwenInit-3B-P4-Python	ParScale $P=4$	🤗 ParScale/ParScale-QwenInit-3B-P4-Python
ParScale-QwenInit-3B-P8-Python	ParScale $P=8$	🤗 ParScale/ParScale-QwenInit-3B-P8-Python

For full pretraining on Pile

Model	Description	Download
ParScale-QwenInit-3B-P1-Pile	Baseline $P=1$	🤗 ParScale/ParScale-QwenInit-3B-P1-Pile
ParScale-QwenInit-3B-P2-Pile	ParScale $P=2$	🤗 ParScale/ParScale-QwenInit-3B-P2-Pile
ParScale-QwenInit-3B-P4-Pile	ParScale $P=4$	🤗 ParScale/ParScale-QwenInit-3B-P4-Pile
ParScale-QwenInit-3B-P8-Pile	ParScale $P=8$	🤗 ParScale/ParScale-QwenInit-3B-P8-Pile

Checkpoints Used to Fit the Scaling Law

Download link: https://huggingface.co/ParScale/ParScale-{size}-{P}-{dataset}

{size}: model size, from {0.7B, 0.9B, 1.3B, 1.8B, 3B, 4.7B}
{P}: number of parallels, from {P1, P2, P4, P8}
{dataset}: training dataset, from {Python, Pile}