tjingrant/sparsellm-1b-80p

This is a sparse language model based on LLaMA2-1B with 80% sparsity.

Model Details

  • Model Type: Sparse Causal Language Model
  • Base Model: LLaMA2-1B
  • Sparsity Configuration: 80% sparsity
  • Training Data: Trained on the Fineweb-Edu dataset
  • Tokenizer: Same as the original LLaMA2 model
  • Perplexity: 25.77 (measured on Wikitext-103)
  • Parameter Counts:
    • Total Parameters: 1.20B
    • Total Linear Parameters: 1.14B
    • Non-zero Linear Parameters: 0.23B
    • Linear Layer Sparsity: 80.00%
    • Average Linear Parameters During Training: 0.45B (Average Density: 0.3984)

Training Parameters

  • Training Steps: 25050
  • Batch Size: 8M tokens (4096 × 2048)
  • Learning Rate: 0.0003
  • Total Training Tokens: 200400000000 (200.4B)
  • Final Training Loss: 2.2275 ± 0.0211 (from last 1% of steps)
  • Pruning Start Step: 2500
  • Pruning End Step: 12600
  • Matching Dense Model: sparsellm-1b-80p-small-dense

Performance and Training Details

Here is the performance and parameter information for all models in this series:

Model Total Params Linear Params Avg Linear Params Non-Zero Linear Sparsity Batch Size LR Total Tokens Final Train Loss Perplexity
sparsellm-1b-20p 1.20B 1.14B 1.02B 0.91B 20.00% 8M 3e-4 89.6B 2.133 ± 0.022 19.58
sparsellm-1b-40p 1.20B 1.14B 0.87B 0.68B 40.00% 8M 3e-4 104.4B 2.137 ± 0.013 19.93
sparsellm-1b-60p 1.20B 1.14B 0.69B 0.46B 60.00% 8M 3e-4 131.0B 2.182 ± 0.017 20.80
sparsellm-1b-80p 1.20B 1.14B 0.45B 0.23B 80.00% 8M 3e-4 200.4B 2.228 ± 0.021 25.77
sparsellm-1b-20p-small-dense 1.07B 1.01B 1.01B 1.01B 0.00% 8M 3e-4 89.6B 2.139 ± 0.022 19.49
sparsellm-1b-40p-small-dense 0.88B 0.82B 0.82B 0.82B 0.00% 8M 3e-4 104.4B 2.161 ± 0.024 21.40
sparsellm-1b-60p-small-dense 0.70B 0.65B 0.65B 0.65B 0.00% 8M 3e-4 131.0B 2.209 ± 0.021 22.58
sparsellm-1b-80p-small-dense 0.46B 0.42B 0.42B 0.42B 0.00% 8M 3e-4 200.4B 2.237 ± 0.028 24.57

Notes:

  • Perplexity is measured on Wikitext-103
  • Batch Size is given in tokens (samples × sequence length)
  • Total Tokens = Training Steps × Batch Size
  • Final Train Loss is computed from the last 1% of training steps (mean ± std)
  • Avg Linear Params is the average number of active parameters during training, computed from the pruning schedule
  • Rows 1-4 are sparse models, rows 5-8 are the matching dense models with (approximately) matching average parameter counts over pretraining.

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("{model_name}")
model = AutoModelForCausalLM.from_pretrained("{model_name}")

inputs = tokenizer("Hello, my name is", return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Citation

If you use this model in your research, please cite our paper:

The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws

@inproceedings{{
jin2025the,
title={{The Journey Matters: Average Parameter Count over Pre-training Unifies Sparse and Dense Scaling Laws}},
author={{Tian Jin and Ahmed Imtiaz Humayun and Utku Evci and Suvinay Subramanian and Amir Yazdanbakhsh and Dan Alistarh and Gintare Karolina Dziugaite}},
booktitle={{The Thirteenth International Conference on Learning Representations}},
year={{2025}},
}}
Downloads last month
3
Safetensors
Model size
1.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support