nanoDiff 150M β SFT (Alpaca-cleaned)
An instruction-tuned masked diffusion language model: the nanoDiff 150M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.
The larger sibling of nanodiff-50m-sft-alpaca. From the nanoDiff project β a minimal, hackable "nanoGPT for diffusion LLMs".
TL;DR
| Base model | Sebasdi/nanodiff-150m-base |
| Fine-tuned on | yahma/alpaca-cleaned (~51k examples) |
| Method | LLaDA Algorithm 2 β masked-diffusion SFT, response-only masking |
| Result | SFT response NLL-bound ~1.27 Β· LAMBADA accuracy 15.74% |
| Parameters | ~203M total Β· ~152M non-embedding |
| Hardware | NVIDIA DGX-Spark (GB10), ~80 min |
What this is
The base model is a document continuer β give it text and it continues it. This fine-tuned version has additionally learned the instruction β response format: given an instruction, it produces an on-topic answer.
SFT changed only behavior, not knowledge. At 150M params and ~3B tokens of pretraining, the model still confabulates β see Limitations.
Base model
Sebasdi/nanodiff-150m-base β a masked diffusion LM, ~203M total / ~152M non-embedding params, pretrained on ~3B tokens of FineWeb-Edu (validation perplexity ~44, LAMBADA accuracy 21.89%).
Fine-tuning data
yahma/alpaca-cleaned β ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.
SFT method β LLaDA Algorithm 2
A small change from pretraining:
- the forward (masking) process masks only the response span β the prompt stays clean as pure conditioning;
- the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
<|endoftext|>padding on short responses stays in the loss β that is what teaches the model to end its answer.
Each example is encoded as [prompt | response], where
prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an
### Input: block when the example has an input field).
Training setup
| Initialized from | nanodiff-150m-base |
| Examples | 50,760 train / 1,000 val |
| Prompt / response length | 256 / 256 tokens |
| Optimizer | AdamW |
| LR schedule | cosine, warmup 100, 7e-5 β 1e-5 |
| Batch size | 64 |
| Iterations | 5,000. Mirrors the 50M-SFT recipe exactly so the comparison is methodologically clean β only the model architecture and the (1/βwidth-rescaled) LR differ. |
| Precision | bf16, torch.compile |
| Hardware | NVIDIA DGX-Spark (GB10) |
| Wall-clock | ~80 minutes |
Comparison vs the 50M sibling
LAMBADA (last-word prediction, single-pass diffusion scoring, full 5153-example test split):
| Model | Best val (SFT) | LAMBADA acc | LAMBADA PPL |
|---|---|---|---|
| nanodiff-50m-sft-alpaca | 1.38 | 14.32% | 3344 |
| nanodiff-150m-sft-alpaca | 1.27 | 15.74% | 1606 |
And for context, on the unaligned base models:
| Model | LAMBADA acc | LAMBADA PPL |
|---|---|---|
| nanodiff-50m-base | 19.83% | 834 |
| nanodiff-150m-base | 21.89% | 358 |
The alignment tax (LAMBADA accuracy drop from base β SFT) is β5.51 pp
for the 50M and β6.15 pp for the 150M β i.e., almost identical in relative
terms (28% of base accuracy lost) at both scales. So scaling capacity helps
both the base and the SFT model, but does not shrink the alignment tax
proportionally. The perplexity blow-up post-SFT is also similar at both scales
(4Γ) β the SFT distribution shift is, to first order, capacity-independent.
Provenance
- Code: https://github.com/BY571/nanoDiff
- Trained at commit:
df2c6a7 - Config:
sft/configs/150m_alpaca.py - Reproduce:
python scripts/prepare_sft_data.py --out-dir data/alpaca_sftthenpython sft/train.py --config sft/configs/150m_alpaca.py
How to use
hf download Sebasdi/nanodiff-150m-sft-alpaca nanodiff-150m-sft-alpaca.pt --local-dir checkpoints/
# interactive β note the --sft flag
python chat.py --ckpt checkpoints/nanodiff-150m-sft-alpaca.pt --sft
In --sft mode chat.py wraps each turn in the instruction template, treats
it as an independent single-turn instruction, and truncates the answer at the
model's <|endoftext|> end-marker. Prompt it with instructions:
- "Give one tip for staying healthy."
- "Write a sentence about the ocean."
- "Explain what photosynthesis is."
The repetition penalty (--rep-penalty 3.0, on by default) is required β small
diffusion LMs collapse into repetition loops without it.
Example
>>> Give one tip for staying healthy.
<<< One tip for staying healthy is to eat well-balanced foods that are high in
essential nutrients such as vitamins and minerals. This will help you stay
fit over the longer term by boosting your body's natural immune system
function while maintaining a steady weight loss rate throughout the day or
week if you're not getting enough sleep at all! A balanced diet can have a
wide range of benefits, including mental health, physical fitness, and
overall health.
(Sample generated at --gen-length 96 --steps 96 --rep-penalty 3.0 --seed 1337.
The 50M SFT, on the same prompt, gave a stilted "step #1, step #2, step #3"
list β the 150M's longer coherent prose is the qualitative improvement that
goes alongside the LAMBADA gain.)
Limitations
- Confabulates. A 150M model pretrained on ~3B tokens has no reliable world model β facts are often wrong (the example above is roughly right about diet but trails off; "Write a sentence about the ocean" confidently asserts the ocean is "over 100 miles in length", which is meaningless). SFT taught it to answer, not to know.
- Grammar is mostly correct but coherence degrades on harder instructions.
- Single-turn only β trained on single-turn Alpaca; no multi-turn chat / conversation history.
- English only; no alignment or safety tuning.
- The SFT distribution shift makes this model worse at FineWeb-Edu-style text continuation than its base (perplexity blows up ~4Γ). Use the base model for completion tasks, this SFT model for instruction-following.
A learning artifact demonstrating the full base β SFT pipeline at the next rung up from the 50M β not a usable assistant.
Citation
@article{nie2025llada,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
journal = {arXiv preprint arXiv:2502.09992},
year = {2025}
}
Model tree for Sebasdi/nanodiff-150m-sft-alpaca
Base model
Sebasdi/nanodiff-150m-base