nanoDiff 50M β€” SFT (Alpaca-cleaned)

An instruction-tuned masked diffusion language model: the nanoDiff 50M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.

From the nanoDiff project β€” a minimal, hackable "nanoGPT for diffusion LLMs".

TL;DR

Base model Sebasdi/nanodiff-50m-base
Fine-tuned on yahma/alpaca-cleaned (~51k examples)
Method LLaDA Algorithm 2 β€” masked-diffusion SFT, response-only masking
Result SFT response NLL-bound ~1.38
Hardware NVIDIA DGX-Spark (GB10), ~45 min

What this is

The base model is a document continuer β€” give it text and it continues it. This fine-tuned version has additionally learned the instruction β†’ response format: given an instruction, it produces an on-topic answer.

SFT changed only behavior, not knowledge. This is still a 50M model and it confabulates freely β€” see Limitations.

Base model

Sebasdi/nanodiff-50m-base β€” a masked diffusion LM, ~88M total / ~50M non-embedding params, pretrained on ~2B tokens of FineWeb-Edu (validation perplexity ~50).

Fine-tuning data

yahma/alpaca-cleaned β€” ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.

SFT method β€” LLaDA Algorithm 2

A small change from pretraining:

  • the forward (masking) process masks only the response span β€” the prompt stays clean as pure conditioning;
  • the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
  • <|endoftext|> padding on short responses stays in the loss β€” that is what teaches the model to end its answer.

Each example is encoded as [prompt | response], where prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an ### Input: block when the example has an input field).

Training setup

Initialized from nanodiff-50m-base
Examples 50,760 train / 1,000 val
Prompt / response length 256 / 256 tokens
Optimizer AdamW
LR schedule cosine, warmup 100, 1e-4 β†’ 1e-5
Batch size 64
Iterations 5,000. The instruction format is learned within ~200 iters, but the loss keeps a slow real descent for several thousand more before flattening near 1.38.
Precision bf16, torch.compile
Hardware NVIDIA DGX-Spark (GB10)
Wall-clock ~45 minutes

Provenance

  • Code: https://github.com/BY571/nanoDiff
  • Trained at commit: c1a65ed
  • Config: sft/configs/50m_alpaca.py
  • Reproduce: python scripts/prepare_sft_data.py --out-dir data/alpaca_sft then python sft/train.py --config sft/configs/50m_alpaca.py

How to use

hf download Sebasdi/nanodiff-50m-sft-alpaca nanodiff-50m-sft-alpaca.pt --local-dir checkpoints/

# interactive β€” note the --sft flag
python chat.py --ckpt checkpoints/nanodiff-50m-sft-alpaca.pt --sft

In --sft mode chat.py wraps each turn in the instruction template, treats it as an independent single-turn instruction, and truncates the answer at the model's <|endoftext|> end-marker. Prompt it with instructions:

  • "Give three tips for staying healthy."
  • "Write a sentence about the ocean."
  • "Explain what photosynthesis is."

The repetition penalty (--rep-penalty 3.0, on by default) is required β€” small diffusion LMs collapse into repetition loops without it.

Example

>>> Give one tip for staying healthy.
<<< One tip ... is by following the following steps: #1. Get enough rest and
    Exercise ... #2. engage in regular exercise ... #3. Take breaks ...

Limitations

  • Confabulates. A 50M model pretrained on ~2B tokens has no reliable world model β€” facts are often wrong. SFT taught it to answer, not to know.
  • Grammar is imperfect and coherence degrades on harder instructions.
  • Single-turn only β€” trained on single-turn Alpaca; no multi-turn chat / conversation history.
  • English only; no alignment or safety tuning.

A learning artifact demonstrating the full base β†’ SFT pipeline at tiny scale β€” not a usable assistant.

Citation

@article{nie2025llada,
  title   = {Large Language Diffusion Models},
  author  = {Nie, Shen and others},
  journal = {arXiv preprint arXiv:2502.09992},
  year    = {2025}
}
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Sebasdi/nanodiff-50m-sft-alpaca

Finetuned
(1)
this model

Dataset used to train Sebasdi/nanodiff-50m-sft-alpaca

Paper for Sebasdi/nanodiff-50m-sft-alpaca