SmolLM3-3B-summarize-sft-lora

I'm working through the Hugging Face Smol Fine-Tuning Language Models course. This is the artifact from Unit 1, Exercise 3: a LoRA adapter that turns SmolLM3-3B-Base into a model that reads a document and writes a one-sentence summary in chat format. The base model has never seen a conversation; the adapter teaches that on top. It's the SFT step of a longer arc (Base → SFT → DPO → publish), and the next unit layers preference alignment on this checkpoint.

What the adapter learns

Three things at once, in 1,425 optimizer steps. The base model is a pretrained language model in the pure sense. It knows English, it continues text, but it doesn't know the difference between a user prompt and an assistant response. The SFT here teaches:

Chat format. Emit ChatML structure (<|im_start|>assistant\n... <|im_end|>) and stop at the right token.
The task. Read a document, produce a third-person summary instead of continuing the document.
Length discipline. Match the reference summary's scope rather than running to the generation budget.

LoRA only touches ~0.97% of the model's parameters (30M of 3.08B). The rest is frozen.

Before / after on a held-out email

The input was a ~250-word email from Isabella to Alejandro proposing a virtual coffee chat. I ran it through both the base model and the fine-tuned adapter with identical prompts and greedy decoding.

Base model:

Dear Alejandro, I'm thrilled to hear that you're interested in exploring the connections between our work... I would be delighted to meet for a virtual coffee chat before the conference. How about next Wednesday at 10 am your time? ...

The base model treats the email as text to continue. It writes more email back, in Isabella's voice. No stop token; runs until the generation budget cuts it off.

Fine-tuned adapter:

Isabella is proposing a virtual coffee chat on Wednesday at 10 am to discuss collaboration on a joint project.<|im_end|>

One sentence. Third-person. Stops itself. Same model, same prompt. The LoRA is doing all of it.

All 5 held-out demos are in generations_before.json and generations_after.json in this repo, with the reference summaries for comparison.

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base", dtype="bfloat16")
model = PeftModel.from_pretrained(base, "tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora")
tokenizer = AutoTokenizer.from_pretrained("tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora")

messages = [{"role": "user", "content": "Summarize the following:\n\n<document>"}]
prompt = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Pass enable_thinking=False. I trained this on SmolLM3's /no_think data. The tokenizer in this repo ships with the right chat template; just don't forget the flag.

Training data

HuggingFaceTB/smoltalk2, config SFT, split smoltalk_smollm3_smol_summarize_no_think. That's 96,061 (document, summary) pairs from SmolLM3's own SFT corpus, in /no_think mode.

Subsample: 12,000 rows (sized to fit a single one-epoch cloud run in a sane budget; I'd train on more next time).
Split: 95% train (11,400) / 5% eval (600), seed 42, built before training.
Max sequence length: 2,304 tokens, picked from the p99 of the token-length distribution (p95=1,544, p99=2,063, max=2,517) so truncation almost never clips the assistant summary.
Loss masking: assistant-only via the chat template's {% generation %} tags. The model is graded only on the summary tokens, not on the input document.

Hyperparameters

LoRA on the attention + MLP projections, base frozen:


LoRA rank `r`	16
LoRA `alpha`	32
LoRA dropout	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Trainable params	30,228,480 (0.97% of 3.08B)
Effective batch	8 (per_device=1 × grad_accum=8)
Optimizer	AdamW
Learning rate	2e-4
Schedule	Cosine, 3% warmup
Epochs	1
Mixed precision	bf16
Seed	42

Training used TRL's SFTTrainer from a custom script. Source: github.com/tuggspeedman-ai/hf-smol-course, see notebooks/unit1/exercise3_sft_lora.py.

Hardware


GPU	1× NVIDIA A100 80GB (HF Jobs flavor `a100-large`)
Wall time	96.9 min for 1,425 optimizer steps (~4.08 s/step)
Cost	~$4 of A100 time

I smoke-tested the same code locally on a 48GB Apple M4 Pro (Metal/MPS) first. 23 s/step there vs ~4 s/step on the A100. The cloud run was a single submission via hf jobs uv run against my own script.

Results


Train loss (first → last logged)	1.0313 → 0.5630
Train loss (averaged)	0.5495
Eval loss	0.44

Eval loss is lower than train loss. No overfitting on the 12k-row run; the adapter would likely improve with a longer scale. The loss drops fastest in the first ~300 steps as the model picks up the chat format, then plateaus into a slower decline as it refines summary style.

Full per-step history is in metrics.json. The resolved training config is in config.json. Token-length distribution analysis is in length_stats.json.

What's still missing

Narrow domain. Training data leans on personal/professional emails, news articles, and short reports. Anything outside that (legal text, code, dense technical writing) likely degrades.
No /think mode. I only trained on _no_think data. Forcing enable_thinking=True at inference is out of distribution.
English only.
No preference alignment yet. That's U2 of the arc (DPO on summary preferences). This SFT stage just teaches the format and the task.
No safety tuning. Inherits the base model's behavior on harmlessness, which is none.
Sample efficiency unexplored. A longer run, a higher LoRA rank, a full fine-tune on a larger cloud box. I haven't ablated any of it. Plenty of headroom.

Related artifacts

tuggspeedman-ai/SmolLM3-3B-trl-cli-demo is the same SFT recipe reproduced via TRL's stock sft.py CLI on a smaller dataset. That's Exercise 4 of the course (the production CLI workflow); this one is Exercise 3 (custom Python).
Course: HF Smol Fine-Tuning Language Models, Unit 1.
Portfolio: jonathanavni.com/projects.

Citation

@misc{avni2026smollm3summarize,
  title  = {SmolLM3-3B-summarize-sft-lora},
  author = {Avni, Jonathan},
  year   = {2026},
  url    = {https://huggingface.co/tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora},
}

Downloads last month: 44

Model tree for tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora

Base model

HuggingFaceTB/SmolLM3-3B-Base

Adapter

(30)

this model

tuggspeedman-ai
/

SmolLM3-3B-summarize-sft-lora