Llama-3.2-3B-CPT-Math

LLaMA-3.2-3B continually pre-trained on 52B math tokens from Nemotron-CC-Math-4plus (quality score ≥ 4, 45M documents).

Released as part of: When Can LLMs Learn to Reason with Weak Supervision? — Rahman, Shen, Mordvina, Palangi, Gabriel, Izmailov (2026)

Training Details


Base model	`meta-llama/Llama-3.2-3B`
Data	Nemotron-CC-Math-4plus (52B tokens)
Epochs	1
Sequence length	2,048
Effective batch size	128 sequences (262K tokens)
Learning rate	2e-5, cosine decay, 5% warmup
Optimizer	AdamW, weight decay 0.01, grad clip 1.0
Precision	BF16 + Flash Attention 2

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("pavelslab-nyu/Llama-3.2-3B-CPT-Math")
tokenizer = AutoTokenizer.from_pretrained("pavelslab-nyu/Llama-3.2-3B-CPT-Math")

Citation

@article{rahman2026when,
  title   = {When Can LLMs Learn to Reason with Weak Supervision?},
  author  = {Rahman, Salman and Shen, Jingyan and Mordvina, Anna and
             Palangi, Hamid and Gabriel, Saadia and Izmailov, Pavel},
  journal = {Preprint},
  year    = {2026}
}