nanoDiff 50M β€” base

A masked diffusion language model trained with the LLaDA recipe, from the nanoDiff project β€” a minimal, hackable "nanoGPT for diffusion LLMs" built for learning how diffusion LMs train.

This is the base model β€” pretrained on raw web text only. It has no instruction tuning: it continues documents, it does not answer questions. See Intended use below.

TL;DR

Architecture Masked diffusion LM (LLaDA recipe), bidirectional Transformer
Parameters ~88M total Β· ~50M non-embedding (tied input/output embeddings)
Training data FineWeb-Edu, 2.1B tokens (1 epoch)
Validation NLL-bound 3.92 nats/token Β· perplexity ~50.6
Hardware NVIDIA DGX-Spark (GB10), ~12 h wall-clock

Provenance

  • Code: https://github.com/BY571/nanoDiff
  • Trained at commit: de6c9a2
  • Config: pretrain/configs/50m.py
  • Reproduce: python scripts/prepare_data.py --out-dir data/fineweb_edu --num-tokens 2_000_000_000 then python pretrain/train.py --config pretrain/configs/50m.py

Architecture

A decoder-style Transformer with bidirectional self-attention (the one architectural change from an autoregressive GPT β€” a diffusion LM denoises all positions at once, so attention is not causally masked).

Hyperparameter Value
Layers 7
Attention heads 12
Embedding dim (n_embd) 768
Context length (block_size) 1024
Positional encoding RoPE
Vocabulary 50304 (GPT-2 BPE 50257 + [MASK]@50257 + padding to a multiple of 64)
Embeddings tied (input = output)
Total parameters ~88M (50M non-embedding)

Training objective

Masked (absorbing-state) diffusion, following LLaDA:

  • Forward process: each token is independently replaced by [MASK] with probability t ~ U(0, 1).
  • Model: predicts the clean tokens at all masked positions in one pass. Time-free parameterization β€” t is not fed to the model.
  • Loss: cross-entropy over masked positions, weighted by 1/t and normalized by sequence length (an upper bound on the negative log-likelihood).

Training setup

Dataset FineWeb-Edu, tokenized with GPT-2 BPE, uint16 memmap
Tokens seen ~2.1B (16,000 iters Γ— batch 128 Γ— 1024 ctx)
Optimizer AdamW (Ξ²1 0.9, Ξ²2 0.95, weight decay 0.1, grad clip 1.0)
LR schedule WSD β€” warmup 500, stable, linear decay over the final 3000 iters
Peak / min LR 1.2e-3 β†’ 1e-5
Batch size 128 sequences (no gradient accumulation)
Precision bf16 autocast
Compilation torch.compile (default mode)
Hardware NVIDIA DGX-Spark (GB10), single device
Wall-clock ~12 hours

Results

Offline evaluation (eval.py, 500 batches) on the FineWeb-Edu validation split:

  • NLL-bound: 3.92 nats/token
  • Perplexity: ~50.6

Intended use

This is a research / learning artifact, not a product.

  • βœ… A base model to supervised-fine-tune (the LLaDA SFT recipe masks only the response tokens).
  • βœ… Studying masked-diffusion training and sampling dynamics.
  • ❌ Not an instruction-following model. Prompt it with document-shaped text ("The capital of France is", "A recipe for bread:"), not questions.

Sampling β€” important

Small diffusion LMs collapse into repetition loops ("the capital of France is the capital of France is…") under the default low-confidence remasking sampler. This is a logit-level bias, not model weakness. Use a frequency repetition penalty when sampling:

python sample.py --ckpt nanodiff-50m-base.pt \
    --prompt "The capital of France is" --rep-penalty 3.0

rep_penalty subtracts penalty Γ— token_count from each token's logits. chat.py and sample.py default it to 3.0. With it, the same weights produce varied, fluent English.

Limitations

  • Confabulates freely. At 50M non-embedding params on 2.1B tokens, factual recall is unreliable ("founded by Louis XIV in 1515"). It learned English syntax, not a world model.
  • Coherence degrades on harder prompts.
  • English only; FineWeb-Edu domain.
  • Base model β€” no alignment, no safety tuning.

Citation

Built on the LLaDA recipe:

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sebasdi/nanodiff-50m-base

Finetunes
1 model

Dataset used to train Sebasdi/nanodiff-50m-base

Paper for Sebasdi/nanodiff-50m-base