nanoDiff 50M — SFT (Alpaca-cleaned)

An instruction-tuned masked diffusion language model: the nanoDiff 50M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.

From the nanoDiff project — a minimal, hackable "nanoGPT for diffusion LLMs".

TL;DR


Base model	Sebasdi/nanodiff-50m-base
Fine-tuned on	yahma/alpaca-cleaned (~51k examples)
Method	LLaDA Algorithm 2 — masked-diffusion SFT, response-only masking
Result	SFT response NLL-bound ~1.38
Hardware	NVIDIA DGX-Spark (GB10), ~45 min

What this is

The base model is a document continuer — give it text and it continues it. This fine-tuned version has additionally learned the instruction → response format: given an instruction, it produces an on-topic answer.

SFT changed only behavior, not knowledge. This is still a 50M model and it confabulates freely — see Limitations.

Base model

Sebasdi/nanodiff-50m-base — a masked diffusion LM, ~88M total / ~50M non-embedding params, pretrained on ~2B tokens of FineWeb-Edu (validation perplexity ~50).

Fine-tuning data

yahma/alpaca-cleaned — ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.

SFT method — LLaDA Algorithm 2

A small change from pretraining:

the forward (masking) process masks only the response span — the prompt stays clean as pure conditioning;
the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
<|endoftext|> padding on short responses stays in the loss — that is what teaches the model to end its answer.

Each example is encoded as [prompt | response], where prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an ### Input: block when the example has an input field).

Training setup


Initialized from	nanodiff-50m-base
Examples	50,760 train / 1,000 val
Prompt / response length	256 / 256 tokens
Optimizer	AdamW
LR schedule	cosine, warmup 100, 1e-4 → 1e-5
Batch size	64
Iterations	5,000. The instruction format is learned within ~200 iters, but the loss keeps a slow real descent for several thousand more before flattening near 1.38.
Precision	bf16, `torch.compile`
Hardware	NVIDIA DGX-Spark (GB10)
Wall-clock	~45 minutes

Provenance

Code: https://github.com/BY571/nanoDiff
Trained at commit: c1a65ed
Config: sft/configs/50m_alpaca.py
Reproduce: python scripts/prepare_sft_data.py --out-dir data/alpaca_sft then python sft/train.py --config sft/configs/50m_alpaca.py

How to use

hf download Sebasdi/nanodiff-50m-sft-alpaca nanodiff-50m-sft-alpaca.pt --local-dir checkpoints/

# interactive — note the --sft flag
python chat.py --ckpt checkpoints/nanodiff-50m-sft-alpaca.pt --sft

In --sft mode chat.py wraps each turn in the instruction template, treats it as an independent single-turn instruction, and truncates the answer at the model's <|endoftext|> end-marker. Prompt it with instructions:

"Give three tips for staying healthy."
"Write a sentence about the ocean."
"Explain what photosynthesis is."

The repetition penalty (--rep-penalty 3.0, on by default) is required — small diffusion LMs collapse into repetition loops without it.

Example

>>> Give one tip for staying healthy.
<<< One tip ... is by following the following steps: #1. Get enough rest and
    Exercise ... #2. engage in regular exercise ... #3. Take breaks ...

Limitations

Confabulates. A 50M model pretrained on ~2B tokens has no reliable world model — facts are often wrong. SFT taught it to answer, not to know.
Grammar is imperfect and coherence degrades on harder instructions.
Single-turn only — trained on single-turn Alpaca; no multi-turn chat / conversation history.
English only; no alignment or safety tuning.

A learning artifact demonstrating the full base → SFT pipeline at tiny scale — not a usable assistant.

Citation

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Sebasdi/nanodiff-50m-sft-alpaca

Base model

Sebasdi/nanodiff-50m-base

Finetuned

(1)

this model

Dataset used to train Sebasdi/nanodiff-50m-sft-alpaca

Paper for Sebasdi/nanodiff-50m-sft-alpaca

Large Language Diffusion Models

Paper • 2502.09992 • Published Feb 14, 2025 • 128