nanoDiff 50M β base
A masked diffusion language model trained with the LLaDA recipe, from the nanoDiff project β a minimal, hackable "nanoGPT for diffusion LLMs" built for learning how diffusion LMs train.
This is the base model β pretrained on raw web text only. It has no instruction tuning: it continues documents, it does not answer questions. See Intended use below.
TL;DR
| Architecture | Masked diffusion LM (LLaDA recipe), bidirectional Transformer |
| Parameters | ~88M total Β· ~50M non-embedding (tied input/output embeddings) |
| Training data | FineWeb-Edu, |
| Validation | NLL-bound 3.92 nats/token Β· perplexity ~50.6 |
| Hardware | NVIDIA DGX-Spark (GB10), ~12 h wall-clock |
Provenance
- Code: https://github.com/BY571/nanoDiff
- Trained at commit:
de6c9a2 - Config:
pretrain/configs/50m.py - Reproduce:
python scripts/prepare_data.py --out-dir data/fineweb_edu --num-tokens 2_000_000_000thenpython pretrain/train.py --config pretrain/configs/50m.py
Architecture
A decoder-style Transformer with bidirectional self-attention (the one architectural change from an autoregressive GPT β a diffusion LM denoises all positions at once, so attention is not causally masked).
| Hyperparameter | Value |
|---|---|
| Layers | 7 |
| Attention heads | 12 |
Embedding dim (n_embd) |
768 |
Context length (block_size) |
1024 |
| Positional encoding | RoPE |
| Vocabulary | 50304 (GPT-2 BPE 50257 + [MASK]@50257 + padding to a multiple of 64) |
| Embeddings | tied (input = output) |
| Total parameters | ~88M (50M non-embedding) |
Training objective
Masked (absorbing-state) diffusion, following LLaDA:
- Forward process: each token is independently replaced by
[MASK]with probabilityt ~ U(0, 1). - Model: predicts the clean tokens at all masked positions in one pass.
Time-free parameterization β
tis not fed to the model. - Loss: cross-entropy over masked positions, weighted by
1/tand normalized by sequence length (an upper bound on the negative log-likelihood).
Training setup
| Dataset | FineWeb-Edu, tokenized with GPT-2 BPE, uint16 memmap |
| Tokens seen | ~2.1B (16,000 iters Γ batch 128 Γ 1024 ctx) |
| Optimizer | AdamW (Ξ²1 0.9, Ξ²2 0.95, weight decay 0.1, grad clip 1.0) |
| LR schedule | WSD β warmup 500, stable, linear decay over the final 3000 iters |
| Peak / min LR | 1.2e-3 β 1e-5 |
| Batch size | 128 sequences (no gradient accumulation) |
| Precision | bf16 autocast |
| Compilation | torch.compile (default mode) |
| Hardware | NVIDIA DGX-Spark (GB10), single device |
| Wall-clock | ~12 hours |
Results
Offline evaluation (eval.py, 500 batches) on the FineWeb-Edu validation split:
- NLL-bound: 3.92 nats/token
- Perplexity: ~50.6
Intended use
This is a research / learning artifact, not a product.
- β A base model to supervised-fine-tune (the LLaDA SFT recipe masks only the response tokens).
- β Studying masked-diffusion training and sampling dynamics.
- β Not an instruction-following model. Prompt it with document-shaped text ("The capital of France is", "A recipe for bread:"), not questions.
Sampling β important
Small diffusion LMs collapse into repetition loops ("the capital of France is the capital of France isβ¦") under the default low-confidence remasking sampler. This is a logit-level bias, not model weakness. Use a frequency repetition penalty when sampling:
python sample.py --ckpt nanodiff-50m-base.pt \
--prompt "The capital of France is" --rep-penalty 3.0
rep_penalty subtracts penalty Γ token_count from each token's logits.
chat.py and sample.py default it to 3.0. With it, the same weights
produce varied, fluent English.
Limitations
- Confabulates freely. At 50M non-embedding params on 2.1B tokens, factual recall is unreliable ("founded by Louis XIV in 1515"). It learned English syntax, not a world model.
- Coherence degrades on harder prompts.
- English only; FineWeb-Edu domain.
- Base model β no alignment, no safety tuning.
Citation
Built on the LLaDA recipe:
@article{nie2025llada,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
journal = {arXiv preprint arXiv:2502.09992},
year = {2025}
}