nanoDiff 50M β SFT (Alpaca-cleaned)
An instruction-tuned masked diffusion language model: the nanoDiff 50M base, supervised-fine-tuned on Alpaca-cleaned so it follows instructions instead of merely continuing text.
From the nanoDiff project β a minimal, hackable "nanoGPT for diffusion LLMs".
TL;DR
| Base model | Sebasdi/nanodiff-50m-base |
| Fine-tuned on | yahma/alpaca-cleaned (~51k examples) |
| Method | LLaDA Algorithm 2 β masked-diffusion SFT, response-only masking |
| Result | SFT response NLL-bound ~1.38 |
| Hardware | NVIDIA DGX-Spark (GB10), ~45 min |
What this is
The base model is a document continuer β give it text and it continues it. This fine-tuned version has additionally learned the instruction β response format: given an instruction, it produces an on-topic answer.
SFT changed only behavior, not knowledge. This is still a 50M model and it confabulates freely β see Limitations.
Base model
Sebasdi/nanodiff-50m-base β a masked diffusion LM, ~88M total / ~50M non-embedding params, pretrained on ~2B tokens of FineWeb-Edu (validation perplexity ~50).
Fine-tuning data
yahma/alpaca-cleaned β ~51,760 single-turn instruction/response examples (the community-curated Alpaca, with the original's factual and formatting errors fixed). 50,760 were used for training, 1,000 held out for validation.
SFT method β LLaDA Algorithm 2
A small change from pretraining:
- the forward (masking) process masks only the response span β the prompt stays clean as pure conditioning;
- the 1/t-weighted cross-entropy loss is computed over masked response tokens, normalized by the response length;
<|endoftext|>padding on short responses stays in the loss β that is what teaches the model to end its answer.
Each example is encoded as [prompt | response], where
prompt = "### Instruction:\n{instruction}\n\n### Response:\n" (with an
### Input: block when the example has an input field).
Training setup
| Initialized from | nanodiff-50m-base |
| Examples | 50,760 train / 1,000 val |
| Prompt / response length | 256 / 256 tokens |
| Optimizer | AdamW |
| LR schedule | cosine, warmup 100, 1e-4 β 1e-5 |
| Batch size | 64 |
| Iterations | 5,000. The instruction format is learned within ~200 iters, but the loss keeps a slow real descent for several thousand more before flattening near 1.38. |
| Precision | bf16, torch.compile |
| Hardware | NVIDIA DGX-Spark (GB10) |
| Wall-clock | ~45 minutes |
Provenance
- Code: https://github.com/BY571/nanoDiff
- Trained at commit:
c1a65ed - Config:
sft/configs/50m_alpaca.py - Reproduce:
python scripts/prepare_sft_data.py --out-dir data/alpaca_sftthenpython sft/train.py --config sft/configs/50m_alpaca.py
How to use
hf download Sebasdi/nanodiff-50m-sft-alpaca nanodiff-50m-sft-alpaca.pt --local-dir checkpoints/
# interactive β note the --sft flag
python chat.py --ckpt checkpoints/nanodiff-50m-sft-alpaca.pt --sft
In --sft mode chat.py wraps each turn in the instruction template, treats
it as an independent single-turn instruction, and truncates the answer at the
model's <|endoftext|> end-marker. Prompt it with instructions:
- "Give three tips for staying healthy."
- "Write a sentence about the ocean."
- "Explain what photosynthesis is."
The repetition penalty (--rep-penalty 3.0, on by default) is required β small
diffusion LMs collapse into repetition loops without it.
Example
>>> Give one tip for staying healthy.
<<< One tip ... is by following the following steps: #1. Get enough rest and
Exercise ... #2. engage in regular exercise ... #3. Take breaks ...
Limitations
- Confabulates. A 50M model pretrained on ~2B tokens has no reliable world model β facts are often wrong. SFT taught it to answer, not to know.
- Grammar is imperfect and coherence degrades on harder instructions.
- Single-turn only β trained on single-turn Alpaca; no multi-turn chat / conversation history.
- English only; no alignment or safety tuning.
A learning artifact demonstrating the full base β SFT pipeline at tiny scale β not a usable assistant.
Citation
@article{nie2025llada,
title = {Large Language Diffusion Models},
author = {Nie, Shen and others},
journal = {arXiv preprint arXiv:2502.09992},
year = {2025}
}
Model tree for Sebasdi/nanodiff-50m-sft-alpaca
Base model
Sebasdi/nanodiff-50m-base