nanoDiff 350M, SFT (Alpaca-cleaned)
An instruction-tuned masked diffusion language model: the nanoDiff 350M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.
The largest sibling in a family of SFT checkpoints alongside nanodiff-50m-sft-alpaca and nanodiff-150m-sft-alpaca. From the nanoDiff project, a minimal, hackable "nanoGPT for diffusion LLMs".
TL;DR
| Base model | Sebasdi/nanodiff-350m-base |
| Fine-tuned on | yahma/alpaca-cleaned (~51k examples) |
| Method | LLaDA Algorithm 2, masked-diffusion SFT with response-only masking |
| Result | SFT response NLL-bound ~1.07 · LAMBADA accuracy 25.13% |
| Parameters | ~382M total · ~317M non-embedding |
| Hardware | NVIDIA DGX-Spark (GB10), ~3 hours |
What this is
The base model is a document continuer: give it text and it continues it. This fine-tuned version has additionally learned the instruction → response format: given an instruction, it produces an on-topic answer.
SFT changed only behavior, not knowledge. At 350M params and ~10B tokens of pretraining, the model still confabulates, though less than the 50M and 150M siblings. See Limitations.
Base model
Sebasdi/nanodiff-350m-base: a masked diffusion LM, ~382M total / ~317M non-embedding params, pretrained on ~10B tokens of FineWeb-Edu (validation perplexity ~29, LAMBADA 36.95%).
Fine-tuning data
yahma/alpaca-cleaned: ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.
SFT method (LLaDA Algorithm 2)
A small change from pretraining:
- the forward (masking) process masks only the response span; the prompt stays clean as pure conditioning;
- the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
<|endoftext|>padding on short responses stays in the loss; that is what teaches the model to end its answer.
Each example is encoded as [prompt | response], where
prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an
### Input: block when the example has an input field).
Training setup
| Initialized from | nanodiff-350m-base |
| Examples | 50,760 train / 1,000 val |
| Prompt / response length | 256 / 256 tokens |
| Optimizer | AdamW |
| LR schedule | cosine, warmup 100, 6e-5 → 1e-5 |
| Batch size | 64 |
| Iterations | 5,000. Mirrors the 50M-SFT and 150M-SFT recipes exactly so the family comparison is methodologically clean. Only the model architecture and the (1/√width-rescaled) LR differ. |
| Precision | bf16, torch.compile |
| Hardware | NVIDIA DGX-Spark (GB10) |
| Wall-clock |
Family comparison
SFT validation loss across the three SFT runs (same dataset, schedule, batch, iter count):
| Model | Best val (SFT) | Δ from prev |
|---|---|---|
| nanodiff-50m-sft-alpaca | 1.38 | — |
| nanodiff-150m-sft-alpaca | 1.27 | -0.11 |
| nanodiff-350m-sft-alpaca | 1.07 | -0.20 |
LAMBADA last-word prediction (single-pass diffusion scoring, full 5153-example test split):
| Model | SFT LAMBADA acc | SFT LAMBADA PPL |
|---|---|---|
| nanodiff-50m-sft-alpaca | 14.32% | 3344 |
| nanodiff-150m-sft-alpaca | 15.74% | 1606 |
| nanodiff-350m-sft-alpaca | 25.13% | 286.6 |
For context, the unaligned base models:
| Model | Base LAMBADA acc | Base LAMBADA PPL |
|---|---|---|
| nanodiff-50m-base | 19.83% | 834 |
| nanodiff-150m-base | 21.89% | 358 |
| nanodiff-350m-base | 36.95% | 55.3 |
The alignment tax grows with scale
Going base → SFT costs:
| Model | Base acc | SFT acc | Δ pp | Δ relative |
|---|---|---|---|---|
| 50M | 19.83% | 14.32% | -5.51 | -27.8% |
| 150M | 21.89% | 15.74% | -6.15 | -28.1% |
| 350M | 36.95% | 25.13% | -11.82 | -32.0% |
At 50M and 150M the relative alignment tax looked constant (~28%), which suggested a capacity-independent SFT distribution shift. At 350M the tax is larger, both in absolute pp (-11.82) and slightly in relative % (-32.0%). The PPL blowup ratio (SFT PPL / base PPL) also grew across scales: 4.0× → 4.5× → 5.2×.
The mechanism is "more to lose": a base model with more world-modeling capability has more to drift away from when SFT shifts the distribution toward Alpaca instruction-format. The 50M base was at near-chance on LAMBADA (19.83% / PPL 834), so SFT couldn't degrade it much. The 350M base was at real capability (36.95% / PPL 55.3), so the post-SFT drop is meaningfully larger.
Scaling still wins in absolute terms (25.13% > 15.74% > 14.32%), but the cost of alignment grows with scale. Worth knowing for SFT strategy at larger sizes.
Provenance
- Code: https://github.com/BY571/nanoDiff
- Trained at commit:
0513e6d - Config:
sft/configs/350m_alpaca.py - Reproduce:
python scripts/prepare_sft_data.py --out-dir data/alpaca_sft python sft/train.py --config sft/configs/350m_alpaca.py
How to use
hf download Sebasdi/nanodiff-350m-sft-alpaca nanodiff-350m-sft-alpaca.pt --local-dir checkpoints/
# interactive (note the --sft flag)
python chat.py --ckpt checkpoints/nanodiff-350m-sft-alpaca.pt --sft
In --sft mode chat.py wraps each turn in the instruction template, treats
it as an independent single-turn instruction, and truncates the answer at the
model's <|endoftext|> end-marker. Prompt it with instructions:
- "Give one tip for staying healthy."
- "Write a sentence about the ocean."
- "Explain what photosynthesis is."
The repetition penalty (--rep-penalty 3.0, on by default) is required; small
diffusion LMs collapse into repetition loops without it.
Limitations
- Confabulates. A 350M model pretrained on ~10B tokens has better factual recall than the 150M, but is still unreliable. SFT taught it to answer, not to know.
- Coherence degrades on harder prompts, though noticeably less than the 150M.
- Single-turn only: trained on single-turn Alpaca, no multi-turn chat or conversation history.
- English only; no alignment or safety tuning beyond the SFT recipe.
- The SFT distribution shift makes this model worse at FineWeb-Edu-style text continuation than its base (LAMBADA PPL blows up from 55.3 to 286.6, ~5.2×). Use the base model for completion tasks, this SFT model for instruction-following.
A learning artifact demonstrating the full base → SFT pipeline at the largest scale in the nanoDiff family. Not a usable assistant.
Citation
@article{nie2025llada,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
journal = {arXiv preprint arXiv:2502.09992},
year = {2025}
}
Model tree for Sebasdi/nanodiff-350m-sft-alpaca
Base model
Sebasdi/nanodiff-350m-base