nanoDiff 350M β base
A masked diffusion language model trained with the LLaDA recipe, from the nanoDiff project β a minimal, hackable "nanoGPT for diffusion LLMs".
The next rung in the scaling ladder after nanodiff-50m-base and nanodiff-150m-base. This is the base model β pretrained on raw web text only, no instruction tuning. It continues documents; it does not answer questions.
TL;DR
| Architecture | Masked diffusion LM (LLaDA recipe), bidirectional Transformer |
| Parameters | ~382M total Β· ~317M non-embedding (tied input/output embeddings) |
| Training data | FineWeb-Edu, ~10B tokens |
| Validation | NLL-bound 3.38 nats/token Β· perplexity 29.3 |
| LAMBADA accuracy | 36.95% |
| Hardware | NVIDIA DGX-Spark (GB10), ~6 days |
Scaling family
This run completes a 4-point capacity scaling sweep on the same FineWeb-Edu shard, same masked-diffusion recipe, same WSD schedule structure:
| Model | Tokens | Val NLL | Perplexity | LAMBADA acc |
|---|---|---|---|---|
| 50M | 2B | 3.92 | 50.6 | 19.83% |
| 50M (matched-token control) | 3B | 3.91 | 50.1 | β |
| 150M | 3B | 3.78 | 43.8 | 21.89% |
| 350M | 10B | 3.38 | 29.3 | 36.95% |
The 150Mβ350M step is the largest gain in the family β -0.38 nats / ~31% PPL reduction β consistent with continuing to scale capacity and training tokens together (10B tokens is Chinchilla-optimal for 350M at ~20 tokens/param, overshooting slightly).
Provenance
- Code: https://github.com/BY571/nanoDiff
- Trained at commit:
30fa4dd - Config:
pretrain/configs/350m.py - Reproduce:
python scripts/prepare_data.py --out-dir data/fineweb_edu_10b \ --num-tokens 10_000_000_000 python pretrain/train.py --config pretrain/configs/350m.py
Architecture
A decoder-style Transformer with bidirectional self-attention β the one architectural change from an autoregressive GPT, since a diffusion LM denoises all positions at once.
| Hyperparameter | Value |
|---|---|
| Layers | 16 |
| Attention heads | 20 |
Embedding dim (n_embd) |
1280 |
| Head dim | 64 |
Context length (block_size) |
512 |
| Positional encoding | RoPE |
| Vocabulary | 50304 (GPT-2 BPE 50257 + [MASK]@50257 + padding) |
| Embeddings | tied (input = output) |
| Total parameters |
Training objective
Masked (absorbing-state) diffusion, following
LLaDA: each token is independently replaced
by [MASK] with probability t ~ U(0, 1); the model predicts the clean tokens
at all masked positions in one pass; the loss is a 1/t-weighted cross-entropy
over masked positions (an upper bound on the negative log-likelihood).
Time-free parameterization β t is not fed to the model.
Training setup
| Dataset | FineWeb-Edu, GPT-2 BPE, uint16 memmap |
| Tokens seen | ~10.0B (76,300 iters Γ batch 64 Γ 4 grad-accum Γ 512 ctx) |
| Optimizer | AdamW (Ξ²1 0.9, Ξ²2 0.95, weight decay 0.1, grad clip 1.0) |
| LR schedule | WSD β warmup 1000, stable, linear decay over the final 25000 iters |
| Peak / min LR | 6e-4 β 1e-5 |
| Batch size | 64 sequences Γ 4 grad-accum = effective 256 sequences/step |
| Precision | bf16 autocast |
| Compilation | torch.compile(mode="default") |
| Hardware | NVIDIA DGX-Spark (GB10), single device |
| Wall-clock |
Results
Offline evaluation (eval.py, 500 batches) on the FineWeb-Edu validation split:
- NLL-bound: 3.38 nats/token
- Perplexity: ~29.3
LAMBADA last-word prediction (single-pass diffusion scoring, full 5153-example test split):
- Accuracy: 36.95% (1904/5153)
- Perplexity (last-word): 55.3
Notable asymmetry vs the 150M: val PPL improved 33% (43.8 β 29.3), but LAMBADA PPL improved 84% (358 β 55.3) and LAMBADA accuracy jumped +15.06 pp (21.89 β 36.95). The larger jump on the discourse-dependent benchmark suggests the added capacity is being spent on long-range representations rather than sharper local statistics β a capability signal, not just a fitting one.
Intended use
A research / learning artifact, not a product.
- β A base model to supervised-fine-tune, or to study masked-diffusion training and sampling dynamics at the next-larger-than-toy scale.
- β Not an instruction-following model. Prompt it document-style ("The history of Rome is"), not question-style.
Sampling β important
Small diffusion LMs collapse into repetition loops under the default
low-confidence remasking sampler. Use the frequency repetition penalty (on by
default in chat.py / sample.py):
python sample.py --ckpt nanodiff-350m-base.pt \
--prompt "The history of Rome is" --rep-penalty 3.0
chat.py and sample.py enable a number of speed defaults (compile, reduced
steps, persistent inductor cache). See the repo
README for the full table.
Limitations
- Confabulates. At ~350M params on ~10B tokens, factual recall is unreliable. It learned English structure, not a dependable world model.
- Coherence degrades on harder prompts.
- English only; FineWeb-Edu domain.
- Base model β no alignment, no safety tuning.
- The training was still improving when it ended β the val curve hadn't flattened. Extending to ~15B tokens would likely give another 0.05-0.10 nats; this checkpoint is not at its capacity ceiling.
Citation
Built on the LLaDA recipe:
@article{nie2025llada,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
journal = {arXiv preprint arXiv:2502.09992},
year = {2025}
}