rlvr-weak-supervision
Collection
Models from "When Can LLMs Learn to Reason with Weak Supervision?" — Llama-3.2-3B with continual pre-training and Thinking SFT. • 3 items • Updated • 2
Llama-3.2-3B-CPT-Math fine-tuned on 43.5K explicit reasoning traces
from OpenThoughts-114k
(math subset, thinking format).
Pipeline: Base → CPT (52B tokens) → Thinking SFT
Released as part of: When Can LLMs Learn to Reason with Weak Supervision? — Rahman, Shen, Mordvina, Palangi, Gabriel, Izmailov (2026)
| Init | pavelslab-nyu/Llama-3.2-3B-CPT-Math |
| Data | OpenThoughts-114k math subset (43.5K examples) |
| Epochs | 3 |
| Sequence length | 8,192 |
| Effective batch size | 256 sequences |
| Learning rate | 1.5e-5, cosine decay, 10% warmup |
| Optimizer | AdamW, weight decay 0.01 |
| Precision | BF16 + Flash Attention 2 |
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("pavelslab-nyu/Llama-3.2-3B-CPT-Math-ThinkSFT")
tokenizer = AutoTokenizer.from_pretrained("pavelslab-nyu/Llama-3.2-3B-CPT-Math-ThinkSFT")
@article{rahman2026when,
title = {When Can LLMs Learn to Reason with Weak Supervision?},
author = {Rahman, Salman and Shen, Jingyan and Mordvina, Anna and
Palangi, Hamid and Gabriel, Saadia and Izmailov, Pavel},
journal = {Preprint},
year = {2026}
}