nanoDiff 350M — base

A masked diffusion language model trained with the LLaDA recipe, from the nanoDiff project — a minimal, hackable "nanoGPT for diffusion LLMs".

The next rung in the scaling ladder after nanodiff-50m-base and nanodiff-150m-base. This is the base model — pretrained on raw web text only, no instruction tuning. It continues documents; it does not answer questions.

TL;DR


Architecture	Masked diffusion LM (LLaDA recipe), bidirectional Transformer
Parameters	~382M total · ~317M non-embedding (tied input/output embeddings)
Training data	FineWeb-Edu, ~10B tokens
Validation	NLL-bound 3.38 nats/token · perplexity 29.3
LAMBADA accuracy	36.95%
Hardware	NVIDIA DGX-Spark (GB10), ~6 days

Scaling family

This run completes a 4-point capacity scaling sweep on the same FineWeb-Edu shard, same masked-diffusion recipe, same WSD schedule structure:

Model	Tokens	Val NLL	Perplexity	LAMBADA acc
50M	2B	3.92	50.6	19.83%
50M (matched-token control)	3B	3.91	50.1	—
150M	3B	3.78	43.8	21.89%
350M	10B	3.38	29.3	36.95%

The 150M→350M step is the largest gain in the family — -0.38 nats / ~31% PPL reduction — consistent with continuing to scale capacity and training tokens together (10B tokens is Chinchilla-optimal for 350M at ~20 tokens/param, overshooting slightly).

Provenance

Code: https://github.com/BY571/nanoDiff
Trained at commit: 30fa4dd
Config: pretrain/configs/350m.py

Reproduce:

python scripts/prepare_data.py --out-dir data/fineweb_edu_10b \
    --num-tokens 10_000_000_000
python pretrain/train.py --config pretrain/configs/350m.py

Architecture

A decoder-style Transformer with bidirectional self-attention — the one architectural change from an autoregressive GPT, since a diffusion LM denoises all positions at once.

Hyperparameter	Value
Layers	16
Attention heads	20
Embedding dim (`n_embd`)	1280
Head dim	64
Context length (`block_size`)	512
Positional encoding	RoPE
Vocabulary	50304 (GPT-2 BPE 50257 + `[MASK]`@50257 + padding)
Embeddings	tied (input = output)
Total parameters	~~382M (~~317M non-embedding)

Training objective

Masked (absorbing-state) diffusion, following LLaDA: each token is independently replaced by [MASK] with probability t ~ U(0, 1); the model predicts the clean tokens at all masked positions in one pass; the loss is a 1/t-weighted cross-entropy over masked positions (an upper bound on the negative log-likelihood). Time-free parameterization — t is not fed to the model.

Training setup


Dataset	FineWeb-Edu, GPT-2 BPE, uint16 memmap
Tokens seen	~10.0B (76,300 iters × batch 64 × 4 grad-accum × 512 ctx)
Optimizer	AdamW (β1 0.9, β2 0.95, weight decay 0.1, grad clip 1.0)
LR schedule	WSD — warmup 1000, stable, linear decay over the final 25000 iters
Peak / min LR	6e-4 → 1e-5
Batch size	64 sequences × 4 grad-accum = effective 256 sequences/step
Precision	bf16 autocast
Compilation	`torch.compile(mode="default")`
Hardware	NVIDIA DGX-Spark (GB10), single device
Wall-clock	~~6 days (~~6.85 sec/iter steady-state)

Results

Offline evaluation (eval.py, 500 batches) on the FineWeb-Edu validation split:

NLL-bound: 3.38 nats/token
Perplexity: ~29.3

LAMBADA last-word prediction (single-pass diffusion scoring, full 5153-example test split):

Accuracy: 36.95% (1904/5153)
Perplexity (last-word): 55.3

Notable asymmetry vs the 150M: val PPL improved 33% (43.8 → 29.3), but LAMBADA PPL improved 84% (358 → 55.3) and LAMBADA accuracy jumped +15.06 pp (21.89 → 36.95). The larger jump on the discourse-dependent benchmark suggests the added capacity is being spent on long-range representations rather than sharper local statistics — a capability signal, not just a fitting one.

Intended use

A research / learning artifact, not a product.

✅ A base model to supervised-fine-tune, or to study masked-diffusion training and sampling dynamics at the next-larger-than-toy scale.
❌ Not an instruction-following model. Prompt it document-style ("The history of Rome is"), not question-style.

Sampling — important

Small diffusion LMs collapse into repetition loops under the default low-confidence remasking sampler. Use the frequency repetition penalty (on by default in chat.py / sample.py):

python sample.py --ckpt nanodiff-350m-base.pt \
    --prompt "The history of Rome is" --rep-penalty 3.0

chat.py and sample.py enable a number of speed defaults (compile, reduced steps, persistent inductor cache). See the repo README for the full table.

Limitations

Confabulates. At ~350M params on ~10B tokens, factual recall is unreliable. It learned English structure, not a dependable world model.
Coherence degrades on harder prompts.
English only; FineWeb-Edu domain.
Base model — no alignment, no safety tuning.
The training was still improving when it ended — the val curve hadn't flattened. Extending to ~15B tokens would likely give another 0.05-0.10 nats; this checkpoint is not at its capacity ceiling.

Citation

Built on the LLaDA recipe:

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Sebasdi/nanodiff-350m-base

Finetunes

1 model

Dataset used to train Sebasdi/nanodiff-350m-base

Paper for Sebasdi/nanodiff-350m-base

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14, 2025 • 128