tjingrant/sparsellm-1b-80p

This is a sparse language model based on LLaMA2-1B with 80% sparsity.

Model Details

Model Type: Sparse Causal Language Model
Base Model: LLaMA2-1B
Sparsity Configuration: 80% sparsity
Training Data: Trained on the Fineweb-Edu dataset
Tokenizer: Same as the original LLaMA2 model
Perplexity: 25.77 (measured on Wikitext-103)
Parameter Counts:
- Total Parameters: 1.20B
- Total Linear Parameters: 1.14B
- Non-zero Linear Parameters: 0.23B
- Linear Layer Sparsity: 80.00%
- Average Linear Parameters During Training: 0.45B (Average Density: 0.3984)

Training Parameters

Training Steps: 25050
Batch Size: 8M tokens (4096 × 2048)
Learning Rate: 0.0003
Total Training Tokens: 200400000000 (200.4B)
Final Training Loss: 2.2275 ± 0.0211 (from last 1% of steps)
Pruning Start Step: 2500
Pruning End Step: 12600
Matching Dense Model: sparsellm-1b-80p-small-dense

Performance and Training Details

Here is the performance and parameter information for all models in this series:

Model	Total Params	Linear Params	Avg Linear Params	Non-Zero Linear	Sparsity	Batch Size	LR	Total Tokens	Final Train Loss	Perplexity
sparsellm-1b-20p	1.20B	1.14B	1.02B	0.91B	20.00%	8M	3e-4	89.6B	2.133 ± 0.022	19.58
sparsellm-1b-40p	1.20B	1.14B	0.87B	0.68B	40.00%	8M	3e-4	104.4B	2.137 ± 0.013	19.93
sparsellm-1b-60p	1.20B	1.14B	0.69B	0.46B	60.00%	8M	3e-4	131.0B	2.182 ± 0.017	20.80
sparsellm-1b-80p	1.20B	1.14B	0.45B	0.23B	80.00%	8M	3e-4	200.4B	2.228 ± 0.021	25.77
sparsellm-1b-20p-small-dense	1.07B	1.01B	1.01B	1.01B	0.00%	8M	3e-4	89.6B	2.139 ± 0.022	19.49
sparsellm-1b-40p-small-dense	0.88B	0.82B	0.82B	0.82B	0.00%	8M	3e-4	104.4B	2.161 ± 0.024	21.40
sparsellm-1b-60p-small-dense	0.70B	0.65B	0.65B	0.65B	0.00%	8M	3e-4	131.0B	2.209 ± 0.021	22.58
sparsellm-1b-80p-small-dense	0.46B	0.42B	0.42B	0.42B	0.00%	8M	3e-4	200.4B	2.237 ± 0.028	24.57

Notes:

Perplexity is measured on Wikitext-103
Batch Size is given in tokens (samples × sequence length)
Total Tokens = Training Steps × Batch Size
Final Train Loss is computed from the last 1% of training steps (mean ± std)
Avg Linear Params is the average number of active parameters during training, computed from the pruning schedule
Rows 1-4 are sparse models, rows 5-8 are the matching dense models with (approximately) matching average parameter counts over pretraining.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("{model_name}")
model = AutoModelForCausalLM.from_pretrained("{model_name}")

inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model in your research, please cite our paper:

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

@inproceedings{{
jin2025the,
title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}},
author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}},
booktitle={{The Thirteenth International Conference on Learning Representations}},
year={{2025}},
}}