nanoDiff 350M, SFT (Alpaca-cleaned)

An instruction-tuned masked diffusion language model: the nanoDiff 350M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.

The largest sibling in a family of SFT checkpoints alongside nanodiff-50m-sft-alpaca and nanodiff-150m-sft-alpaca. From the nanoDiff project, a minimal, hackable "nanoGPT for diffusion LLMs".

TL;DR

Base model Sebasdi/nanodiff-350m-base
Fine-tuned on yahma/alpaca-cleaned (~51k examples)
Method LLaDA Algorithm 2, masked-diffusion SFT with response-only masking
Result SFT response NLL-bound ~1.07 · LAMBADA accuracy 25.13%
Parameters ~382M total · ~317M non-embedding
Hardware NVIDIA DGX-Spark (GB10), ~3 hours

What this is

The base model is a document continuer: give it text and it continues it. This fine-tuned version has additionally learned the instruction → response format: given an instruction, it produces an on-topic answer.

SFT changed only behavior, not knowledge. At 350M params and ~10B tokens of pretraining, the model still confabulates, though less than the 50M and 150M siblings. See Limitations.

Base model

Sebasdi/nanodiff-350m-base: a masked diffusion LM, ~382M total / ~317M non-embedding params, pretrained on ~10B tokens of FineWeb-Edu (validation perplexity ~29, LAMBADA 36.95%).

Fine-tuning data

yahma/alpaca-cleaned: ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.

SFT method (LLaDA Algorithm 2)

A small change from pretraining:

  • the forward (masking) process masks only the response span; the prompt stays clean as pure conditioning;
  • the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
  • <|endoftext|> padding on short responses stays in the loss; that is what teaches the model to end its answer.

Each example is encoded as [prompt | response], where prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an ### Input: block when the example has an input field).

Training setup

Initialized from nanodiff-350m-base
Examples 50,760 train / 1,000 val
Prompt / response length 256 / 256 tokens
Optimizer AdamW
LR schedule cosine, warmup 100, 6e-5 → 1e-5
Batch size 64
Iterations 5,000. Mirrors the 50M-SFT and 150M-SFT recipes exactly so the family comparison is methodologically clean. Only the model architecture and the (1/√width-rescaled) LR differ.
Precision bf16, torch.compile
Hardware NVIDIA DGX-Spark (GB10)
Wall-clock 3 hours (1.69 sec/iter)

Family comparison

SFT validation loss across the three SFT runs (same dataset, schedule, batch, iter count):

Model Best val (SFT) Δ from prev
nanodiff-50m-sft-alpaca 1.38 —
nanodiff-150m-sft-alpaca 1.27 -0.11
nanodiff-350m-sft-alpaca 1.07 -0.20

LAMBADA last-word prediction (single-pass diffusion scoring, full 5153-example test split):

Model SFT LAMBADA acc SFT LAMBADA PPL
nanodiff-50m-sft-alpaca 14.32% 3344
nanodiff-150m-sft-alpaca 15.74% 1606
nanodiff-350m-sft-alpaca 25.13% 286.6

For context, the unaligned base models:

Model Base LAMBADA acc Base LAMBADA PPL
nanodiff-50m-base 19.83% 834
nanodiff-150m-base 21.89% 358
nanodiff-350m-base 36.95% 55.3

The alignment tax grows with scale

Going base → SFT costs:

Model Base acc SFT acc Δ pp Δ relative
50M 19.83% 14.32% -5.51 -27.8%
150M 21.89% 15.74% -6.15 -28.1%
350M 36.95% 25.13% -11.82 -32.0%

At 50M and 150M the relative alignment tax looked constant (~28%), which suggested a capacity-independent SFT distribution shift. At 350M the tax is larger, both in absolute pp (-11.82) and slightly in relative % (-32.0%). The PPL blowup ratio (SFT PPL / base PPL) also grew across scales: 4.0× → 4.5× → 5.2×.

The mechanism is "more to lose": a base model with more world-modeling capability has more to drift away from when SFT shifts the distribution toward Alpaca instruction-format. The 50M base was at near-chance on LAMBADA (19.83% / PPL 834), so SFT couldn't degrade it much. The 350M base was at real capability (36.95% / PPL 55.3), so the post-SFT drop is meaningfully larger.

Scaling still wins in absolute terms (25.13% > 15.74% > 14.32%), but the cost of alignment grows with scale. Worth knowing for SFT strategy at larger sizes.

Provenance

How to use

hf download Sebasdi/nanodiff-350m-sft-alpaca nanodiff-350m-sft-alpaca.pt --local-dir checkpoints/

# interactive (note the --sft flag)
python chat.py --ckpt checkpoints/nanodiff-350m-sft-alpaca.pt --sft

In --sft mode chat.py wraps each turn in the instruction template, treats it as an independent single-turn instruction, and truncates the answer at the model's <|endoftext|> end-marker. Prompt it with instructions:

  • "Give one tip for staying healthy."
  • "Write a sentence about the ocean."
  • "Explain what photosynthesis is."

The repetition penalty (--rep-penalty 3.0, on by default) is required; small diffusion LMs collapse into repetition loops without it.

Limitations

  • Confabulates. A 350M model pretrained on ~10B tokens has better factual recall than the 150M, but is still unreliable. SFT taught it to answer, not to know.
  • Coherence degrades on harder prompts, though noticeably less than the 150M.
  • Single-turn only: trained on single-turn Alpaca, no multi-turn chat or conversation history.
  • English only; no alignment or safety tuning beyond the SFT recipe.
  • The SFT distribution shift makes this model worse at FineWeb-Edu-style text continuation than its base (LAMBADA PPL blows up from 55.3 to 286.6, ~5.2×). Use the base model for completion tasks, this SFT model for instruction-following.

A learning artifact demonstrating the full base → SFT pipeline at the largest scale in the nanoDiff family. Not a usable assistant.

Citation

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Sebasdi/nanodiff-350m-sft-alpaca

Finetuned
(1)
this model

Dataset used to train Sebasdi/nanodiff-350m-sft-alpaca

Paper for Sebasdi/nanodiff-350m-sft-alpaca