nanoDiff 350M, SFT (Alpaca-cleaned)

An instruction-tuned masked diffusion language model: the nanoDiff 350M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.

The largest sibling in a family of SFT checkpoints alongside nanodiff-50m-sft-alpaca and nanodiff-150m-sft-alpaca. From the nanoDiff project, a minimal, hackable "nanoGPT for diffusion LLMs".

TL;DR


Base model	Sebasdi/nanodiff-350m-base
Fine-tuned on	yahma/alpaca-cleaned (~51k examples)
Method	LLaDA Algorithm 2, masked-diffusion SFT with response-only masking
Result	SFT response NLL-bound ~1.07 · LAMBADA accuracy 25.13%
Parameters	~382M total · ~317M non-embedding
Hardware	NVIDIA DGX-Spark (GB10), ~3 hours

What this is

The base model is a document continuer: give it text and it continues it. This fine-tuned version has additionally learned the instruction → response format: given an instruction, it produces an on-topic answer.

SFT changed only behavior, not knowledge. At 350M params and ~10B tokens of pretraining, the model still confabulates, though less than the 50M and 150M siblings. See Limitations.

Base model

Sebasdi/nanodiff-350m-base: a masked diffusion LM, ~382M total / ~317M non-embedding params, pretrained on ~10B tokens of FineWeb-Edu (validation perplexity ~29, LAMBADA 36.95%).

Fine-tuning data

yahma/alpaca-cleaned: ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.

SFT method (LLaDA Algorithm 2)

A small change from pretraining:

the forward (masking) process masks only the response span; the prompt stays clean as pure conditioning;
the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
<|endoftext|> padding on short responses stays in the loss; that is what teaches the model to end its answer.

Each example is encoded as [prompt | response], where prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an ### Input: block when the example has an input field).

Training setup


Initialized from	nanodiff-350m-base
Examples	50,760 train / 1,000 val
Prompt / response length	256 / 256 tokens
Optimizer	AdamW
LR schedule	cosine, warmup 100, 6e-5 → 1e-5
Batch size	64
Iterations	5,000. Mirrors the 50M-SFT and 150M-SFT recipes exactly so the family comparison is methodologically clean. Only the model architecture and the (1/√width-rescaled) LR differ.
Precision	bf16, `torch.compile`
Hardware	NVIDIA DGX-Spark (GB10)
Wall-clock	~~3 hours (~~1.69 sec/iter)

Family comparison

SFT validation loss across the three SFT runs (same dataset, schedule, batch, iter count):

Model	Best val (SFT)	Δ from prev
nanodiff-50m-sft-alpaca	1.38	—
nanodiff-150m-sft-alpaca	1.27	-0.11
nanodiff-350m-sft-alpaca	1.07	-0.20

LAMBADA last-word prediction (single-pass diffusion scoring, full 5153-example test split):

Model	SFT LAMBADA acc	SFT LAMBADA PPL
nanodiff-50m-sft-alpaca	14.32%	3344
nanodiff-150m-sft-alpaca	15.74%	1606
nanodiff-350m-sft-alpaca	25.13%	286.6

For context, the unaligned base models:

Model	Base LAMBADA acc	Base LAMBADA PPL
nanodiff-50m-base	19.83%	834
nanodiff-150m-base	21.89%	358
nanodiff-350m-base	36.95%	55.3

The alignment tax grows with scale

Going base → SFT costs:

Model	Base acc	SFT acc	Δ pp	Δ relative
50M	19.83%	14.32%	-5.51	-27.8%
150M	21.89%	15.74%	-6.15	-28.1%
350M	36.95%	25.13%	-11.82	-32.0%

At 50M and 150M the relative alignment tax looked constant (~28%), which suggested a capacity-independent SFT distribution shift. At 350M the tax is larger, both in absolute pp (-11.82) and slightly in relative % (-32.0%). The PPL blowup ratio (SFT PPL / base PPL) also grew across scales: 4.0× → 4.5× → 5.2×.

The mechanism is "more to lose": a base model with more world-modeling capability has more to drift away from when SFT shifts the distribution toward Alpaca instruction-format. The 50M base was at near-chance on LAMBADA (19.83% / PPL 834), so SFT couldn't degrade it much. The 350M base was at real capability (36.95% / PPL 55.3), so the post-SFT drop is meaningfully larger.

Scaling still wins in absolute terms (25.13% > 15.74% > 14.32%), but the cost of alignment grows with scale. Worth knowing for SFT strategy at larger sizes.

Provenance

Code: https://github.com/BY571/nanoDiff
Trained at commit: 0513e6d
Config: sft/configs/350m_alpaca.py

Reproduce:

python scripts/prepare_sft_data.py --out-dir data/alpaca_sft
python sft/train.py --config sft/configs/350m_alpaca.py

How to use

hf download Sebasdi/nanodiff-350m-sft-alpaca nanodiff-350m-sft-alpaca.pt --local-dir checkpoints/

# interactive (note the --sft flag)
python chat.py --ckpt checkpoints/nanodiff-350m-sft-alpaca.pt --sft

In --sft mode chat.py wraps each turn in the instruction template, treats it as an independent single-turn instruction, and truncates the answer at the model's <|endoftext|> end-marker. Prompt it with instructions:

"Give one tip for staying healthy."
"Write a sentence about the ocean."
"Explain what photosynthesis is."

The repetition penalty (--rep-penalty 3.0, on by default) is required; small diffusion LMs collapse into repetition loops without it.

Limitations

Confabulates. A 350M model pretrained on ~10B tokens has better factual recall than the 150M, but is still unreliable. SFT taught it to answer, not to know.
Coherence degrades on harder prompts, though noticeably less than the 150M.
Single-turn only: trained on single-turn Alpaca, no multi-turn chat or conversation history.
English only; no alignment or safety tuning beyond the SFT recipe.
The SFT distribution shift makes this model worse at FineWeb-Edu-style text continuation than its base (LAMBADA PPL blows up from 55.3 to 286.6, ~5.2×). Use the base model for completion tasks, this SFT model for instruction-following.

A learning artifact demonstrating the full base → SFT pipeline at the largest scale in the nanoDiff family. Not a usable assistant.

Citation

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Sebasdi/nanodiff-350m-sft-alpaca

Base model

Sebasdi/nanodiff-350m-base

Finetuned

(1)

this model

Dataset used to train Sebasdi/nanodiff-350m-sft-alpaca

Paper for Sebasdi/nanodiff-350m-sft-alpaca

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14, 2025 • 128