nanoDiff 150M β base
A masked diffusion language model trained with the LLaDA recipe, from the nanoDiff project β a minimal, hackable "nanoGPT for diffusion LLMs".
The larger sibling of nanodiff-50m-base. This is the base model β pretrained on raw web text only, no instruction tuning. It continues documents; it does not answer questions.
TL;DR
| Architecture | Masked diffusion LM (LLaDA recipe), bidirectional Transformer |
| Parameters | ~203M total Β· ~152M non-embedding (tied input/output embeddings) |
| Training data | FineWeb-Edu, ~3B tokens |
| Validation | NLL-bound 3.78 nats/token Β· perplexity 43.8 |
| Hardware | NVIDIA DGX-Spark (GB10), ~25 h |
For reference, the 50M sibling reaches val 3.92 (perplexity ~50) β but note the two were trained on different token budgets (50M: 2B, 150M: 3B), so that gap is not a clean capacity-only comparison.
Provenance
- Code: https://github.com/BY571/nanoDiff
- Trained at commit:
fc0833c - Config:
pretrain/configs/150m.py - Reproduce:
python scripts/prepare_data.py --out-dir data/fineweb_edu_10b --num-tokens 10_000_000_000thenpython pretrain/train.py --config pretrain/configs/150m.py
Architecture
A decoder-style Transformer with bidirectional self-attention β the one architectural change from an autoregressive GPT, since a diffusion LM denoises all positions at once.
| Hyperparameter | Value |
|---|---|
| Layers | 12 |
| Attention heads | 16 |
Embedding dim (n_embd) |
1024 |
Context length (block_size) |
512 |
| Positional encoding | RoPE |
| Vocabulary | 50304 (GPT-2 BPE 50257 + [MASK]@50257 + padding) |
| Embeddings | tied (input = output) |
| Total parameters |
Training objective
Masked (absorbing-state) diffusion, following
LLaDA: each token is independently replaced
by [MASK] with probability t ~ U(0, 1); the model predicts the clean tokens
at all masked positions in one pass; the loss is a 1/t-weighted cross-entropy
over masked positions (an upper bound on the negative log-likelihood).
Time-free parameterization β t is not fed to the model.
Training setup
| Dataset | FineWeb-Edu, GPT-2 BPE, uint16 memmap |
| Tokens seen | ~3.0B (46,000 iters Γ batch 128 Γ 512 ctx) |
| Optimizer | AdamW (Ξ²1 0.9, Ξ²2 0.95, weight decay 0.1, grad clip 1.0) |
| LR schedule | WSD β warmup 1000, stable, linear decay over the final 15000 iters |
| Peak / min LR | 8e-4 β 1e-5 |
| Batch size | 128 sequences |
| Precision | bf16 autocast |
| Compilation | torch.compile |
| Hardware | NVIDIA DGX-Spark (GB10), single device |
| Wall-clock | ~25 hours |
Results
Offline evaluation (eval.py, 500 batches) on the FineWeb-Edu validation split:
- NLL-bound: 3.78 nats/token
- Perplexity: ~43.8
Intended use
A research / learning artifact, not a product.
- β A base model to supervised-fine-tune, or to study masked-diffusion training and sampling dynamics.
- β Not an instruction-following model. Prompt it document-style ("The history of Rome is"), not question-style.
Sampling β important
Small diffusion LMs collapse into repetition loops under the default low-confidence remasking sampler. Use a frequency repetition penalty:
python sample.py --ckpt nanodiff-150m-base.pt \
--prompt "The history of Rome is" --rep-penalty 3.0
chat.py and sample.py default rep_penalty to 3.0.
Limitations
- Confabulates. At ~150M params on ~3B tokens, factual recall is unreliable. It learned English structure, not a dependable world model.
- Coherence degrades on harder prompts.
- English only; FineWeb-Edu domain.
- Base model β no alignment, no safety tuning.
Citation
Built on the LLaDA recipe:
@article{nie2025llada,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
journal = {arXiv preprint arXiv:2502.09992},
year = {2025}
}