tjingrant/sparsellm-1b-80p
This is a sparse language model based on LLaMA2-1B with 80% sparsity.
Model Details
- Model Type: Sparse Causal Language Model
- Base Model: LLaMA2-1B
- Sparsity Configuration: 80% sparsity
- Training Data: Trained on the Fineweb-Edu dataset
- Tokenizer: Same as the original LLaMA2 model
- Perplexity: 25.77 (measured on Wikitext-103)
- Parameter Counts:
- Total Parameters: 1.20B
- Total Linear Parameters: 1.14B
- Non-zero Linear Parameters: 0.23B
- Linear Layer Sparsity: 80.00%
- Average Linear Parameters During Training: 0.45B (Average Density: 0.3984)
Training Parameters
- Training Steps: 25050
- Batch Size: 8M tokens (4096 × 2048)
- Learning Rate: 0.0003
- Total Training Tokens: 200400000000 (200.4B)
- Final Training Loss: 2.2275 ± 0.0211 (from last 1% of steps)
- Pruning Start Step: 2500
- Pruning End Step: 12600
- Matching Dense Model: sparsellm-1b-80p-small-dense
Performance and Training Details
Here is the performance and parameter information for all models in this series:
Model | Total Params | Linear Params | Avg Linear Params | Non-Zero Linear | Sparsity | Batch Size | LR | Total Tokens | Final Train Loss | Perplexity |
---|---|---|---|---|---|---|---|---|---|---|
sparsellm-1b-20p | 1.20B | 1.14B | 1.02B | 0.91B | 20.00% | 8M | 3e-4 | 89.6B | 2.133 ± 0.022 | 19.58 |
sparsellm-1b-40p | 1.20B | 1.14B | 0.87B | 0.68B | 40.00% | 8M | 3e-4 | 104.4B | 2.137 ± 0.013 | 19.93 |
sparsellm-1b-60p | 1.20B | 1.14B | 0.69B | 0.46B | 60.00% | 8M | 3e-4 | 131.0B | 2.182 ± 0.017 | 20.80 |
sparsellm-1b-80p | 1.20B | 1.14B | 0.45B | 0.23B | 80.00% | 8M | 3e-4 | 200.4B | 2.228 ± 0.021 | 25.77 |
sparsellm-1b-20p-small-dense | 1.07B | 1.01B | 1.01B | 1.01B | 0.00% | 8M | 3e-4 | 89.6B | 2.139 ± 0.022 | 19.49 |
sparsellm-1b-40p-small-dense | 0.88B | 0.82B | 0.82B | 0.82B | 0.00% | 8M | 3e-4 | 104.4B | 2.161 ± 0.024 | 21.40 |
sparsellm-1b-60p-small-dense | 0.70B | 0.65B | 0.65B | 0.65B | 0.00% | 8M | 3e-4 | 131.0B | 2.209 ± 0.021 | 22.58 |
sparsellm-1b-80p-small-dense | 0.46B | 0.42B | 0.42B | 0.42B | 0.00% | 8M | 3e-4 | 200.4B | 2.237 ± 0.028 | 24.57 |
Notes:
- Perplexity is measured on Wikitext-103
- Batch Size is given in tokens (samples × sequence length)
- Total Tokens = Training Steps × Batch Size
- Final Train Loss is computed from the last 1% of training steps (mean ± std)
- Avg Linear Params is the average number of active parameters during training, computed from the pruning schedule
- Rows 1-4 are sparse models, rows 5-8 are the matching dense models with (approximately) matching average parameter counts over pretraining.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("{model_name}")
model = AutoModelForCausalLM.from_pretrained("{model_name}")
inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Citation
If you use this model in your research, please cite our paper:
The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws
@inproceedings{{
jin2025the,
title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}},
author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}},
booktitle={{The Thirteenth International Conference on Learning Representations}},
year={{2025}},
}}
- Downloads last month
- 3
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support
HF Inference deployability: The model has no library tag.