nanoDiff 150M β€” SFT (Alpaca-cleaned)

An instruction-tuned masked diffusion language model: the nanoDiff 150M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.

The larger sibling of nanodiff-50m-sft-alpaca. From the nanoDiff project β€” a minimal, hackable "nanoGPT for diffusion LLMs".

TL;DR

Base model Sebasdi/nanodiff-150m-base
Fine-tuned on yahma/alpaca-cleaned (~51k examples)
Method LLaDA Algorithm 2 β€” masked-diffusion SFT, response-only masking
Result SFT response NLL-bound ~1.27 Β· LAMBADA accuracy 15.74%
Parameters ~203M total Β· ~152M non-embedding
Hardware NVIDIA DGX-Spark (GB10), ~80 min

What this is

The base model is a document continuer β€” give it text and it continues it. This fine-tuned version has additionally learned the instruction β†’ response format: given an instruction, it produces an on-topic answer.

SFT changed only behavior, not knowledge. At 150M params and ~3B tokens of pretraining, the model still confabulates β€” see Limitations.

Base model

Sebasdi/nanodiff-150m-base β€” a masked diffusion LM, ~203M total / ~152M non-embedding params, pretrained on ~3B tokens of FineWeb-Edu (validation perplexity ~44, LAMBADA accuracy 21.89%).

Fine-tuning data

yahma/alpaca-cleaned β€” ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.

SFT method β€” LLaDA Algorithm 2

A small change from pretraining:

  • the forward (masking) process masks only the response span β€” the prompt stays clean as pure conditioning;
  • the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
  • <|endoftext|> padding on short responses stays in the loss β€” that is what teaches the model to end its answer.

Each example is encoded as [prompt | response], where prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an ### Input: block when the example has an input field).

Training setup

Initialized from nanodiff-150m-base
Examples 50,760 train / 1,000 val
Prompt / response length 256 / 256 tokens
Optimizer AdamW
LR schedule cosine, warmup 100, 7e-5 β†’ 1e-5
Batch size 64
Iterations 5,000. Mirrors the 50M-SFT recipe exactly so the comparison is methodologically clean β€” only the model architecture and the (1/√width-rescaled) LR differ.
Precision bf16, torch.compile
Hardware NVIDIA DGX-Spark (GB10)
Wall-clock ~80 minutes

Comparison vs the 50M sibling

LAMBADA (last-word prediction, single-pass diffusion scoring, full 5153-example test split):

Model Best val (SFT) LAMBADA acc LAMBADA PPL
nanodiff-50m-sft-alpaca 1.38 14.32% 3344
nanodiff-150m-sft-alpaca 1.27 15.74% 1606

And for context, on the unaligned base models:

Model LAMBADA acc LAMBADA PPL
nanodiff-50m-base 19.83% 834
nanodiff-150m-base 21.89% 358

The alignment tax (LAMBADA accuracy drop from base β†’ SFT) is βˆ’5.51 pp for the 50M and βˆ’6.15 pp for the 150M β€” i.e., almost identical in relative terms (28% of base accuracy lost) at both scales. So scaling capacity helps both the base and the SFT model, but does not shrink the alignment tax proportionally. The perplexity blow-up post-SFT is also similar at both scales (4Γ—) β€” the SFT distribution shift is, to first order, capacity-independent.

Provenance

  • Code: https://github.com/BY571/nanoDiff
  • Trained at commit: df2c6a7
  • Config: sft/configs/150m_alpaca.py
  • Reproduce: python scripts/prepare_sft_data.py --out-dir data/alpaca_sft then python sft/train.py --config sft/configs/150m_alpaca.py

How to use

hf download Sebasdi/nanodiff-150m-sft-alpaca nanodiff-150m-sft-alpaca.pt --local-dir checkpoints/

# interactive β€” note the --sft flag
python chat.py --ckpt checkpoints/nanodiff-150m-sft-alpaca.pt --sft

In --sft mode chat.py wraps each turn in the instruction template, treats it as an independent single-turn instruction, and truncates the answer at the model's <|endoftext|> end-marker. Prompt it with instructions:

  • "Give one tip for staying healthy."
  • "Write a sentence about the ocean."
  • "Explain what photosynthesis is."

The repetition penalty (--rep-penalty 3.0, on by default) is required β€” small diffusion LMs collapse into repetition loops without it.

Example

>>> Give one tip for staying healthy.
<<< One tip for staying healthy is to eat well-balanced foods that are high in
    essential nutrients such as vitamins and minerals. This will help you stay
    fit over the longer term by boosting your body's natural immune system
    function while maintaining a steady weight loss rate throughout the day or
    week if you're not getting enough sleep at all! A balanced diet can have a
    wide range of benefits, including mental health, physical fitness, and
    overall health.

(Sample generated at --gen-length 96 --steps 96 --rep-penalty 3.0 --seed 1337. The 50M SFT, on the same prompt, gave a stilted "step #1, step #2, step #3" list β€” the 150M's longer coherent prose is the qualitative improvement that goes alongside the LAMBADA gain.)

Limitations

  • Confabulates. A 150M model pretrained on ~3B tokens has no reliable world model β€” facts are often wrong (the example above is roughly right about diet but trails off; "Write a sentence about the ocean" confidently asserts the ocean is "over 100 miles in length", which is meaningless). SFT taught it to answer, not to know.
  • Grammar is mostly correct but coherence degrades on harder instructions.
  • Single-turn only β€” trained on single-turn Alpaca; no multi-turn chat / conversation history.
  • English only; no alignment or safety tuning.
  • The SFT distribution shift makes this model worse at FineWeb-Edu-style text continuation than its base (perplexity blows up ~4Γ—). Use the base model for completion tasks, this SFT model for instruction-following.

A learning artifact demonstrating the full base β†’ SFT pipeline at the next rung up from the 50M β€” not a usable assistant.

Citation

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sebasdi/nanodiff-150m-sft-alpaca

Finetuned
(1)
this model

Dataset used to train Sebasdi/nanodiff-150m-sft-alpaca

Paper for Sebasdi/nanodiff-150m-sft-alpaca