Instructions to use tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base") model = PeftModel.from_pretrained(base_model, "tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora") - Notebooks
- Google Colab
- Kaggle
SmolLM3-3B-summarize-sft-lora
I'm working through the Hugging Face Smol Fine-Tuning Language Models course. This is the artifact from Unit 1, Exercise 3: a LoRA adapter that turns SmolLM3-3B-Base into a model that reads a document and writes a one-sentence summary in chat format. The base model has never seen a conversation; the adapter teaches that on top. It's the SFT step of a longer arc (Base → SFT → DPO → publish), and the next unit layers preference alignment on this checkpoint.
What the adapter learns
Three things at once, in 1,425 optimizer steps. The base model is a pretrained language model in the pure sense. It knows English, it continues text, but it doesn't know the difference between a user prompt and an assistant response. The SFT here teaches:
- Chat format. Emit ChatML structure (
<|im_start|>assistant\n... <|im_end|>) and stop at the right token. - The task. Read a document, produce a third-person summary instead of continuing the document.
- Length discipline. Match the reference summary's scope rather than running to the generation budget.
LoRA only touches ~0.97% of the model's parameters (30M of 3.08B). The rest is frozen.
Before / after on a held-out email
The input was a ~250-word email from Isabella to Alejandro proposing a virtual coffee chat. I ran it through both the base model and the fine-tuned adapter with identical prompts and greedy decoding.
Base model:
Dear Alejandro, I'm thrilled to hear that you're interested in exploring the connections between our work... I would be delighted to meet for a virtual coffee chat before the conference. How about next Wednesday at 10 am your time? ...
The base model treats the email as text to continue. It writes more email back, in Isabella's voice. No stop token; runs until the generation budget cuts it off.
Fine-tuned adapter:
Isabella is proposing a virtual coffee chat on Wednesday at 10 am to discuss collaboration on a joint project.
<|im_end|>
One sentence. Third-person. Stops itself. Same model, same prompt. The LoRA is doing all of it.
All 5 held-out demos are in generations_before.json and generations_after.json in this repo, with the reference summaries for comparison.
How to use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM3-3B-Base", dtype="bfloat16")
model = PeftModel.from_pretrained(base, "tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora")
tokenizer = AutoTokenizer.from_pretrained("tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora")
messages = [{"role": "user", "content": "Summarize the following:\n\n<document>"}]
prompt = tokenizer.apply_chat_template(
messages, tokenize=False, add_generation_prompt=True, enable_thinking=False,
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Pass enable_thinking=False. I trained this on SmolLM3's /no_think data. The tokenizer in this repo ships with the right chat template; just don't forget the flag.
Training data
HuggingFaceTB/smoltalk2, config SFT, split smoltalk_smollm3_smol_summarize_no_think. That's 96,061 (document, summary) pairs from SmolLM3's own SFT corpus, in /no_think mode.
- Subsample: 12,000 rows (sized to fit a single one-epoch cloud run in a sane budget; I'd train on more next time).
- Split: 95% train (11,400) / 5% eval (600), seed 42, built before training.
- Max sequence length: 2,304 tokens, picked from the p99 of the token-length distribution (p95=1,544, p99=2,063, max=2,517) so truncation almost never clips the assistant summary.
- Loss masking: assistant-only via the chat template's
{% generation %}tags. The model is graded only on the summary tokens, not on the input document.
Hyperparameters
LoRA on the attention + MLP projections, base frozen:
LoRA rank r |
16 |
LoRA alpha |
32 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Trainable params | 30,228,480 (0.97% of 3.08B) |
| Effective batch | 8 (per_device=1 × grad_accum=8) |
| Optimizer | AdamW |
| Learning rate | 2e-4 |
| Schedule | Cosine, 3% warmup |
| Epochs | 1 |
| Mixed precision | bf16 |
| Seed | 42 |
Training used TRL's SFTTrainer from a custom script. Source: github.com/tuggspeedman-ai/hf-smol-course, see notebooks/unit1/exercise3_sft_lora.py.
Hardware
| GPU | 1× NVIDIA A100 80GB (HF Jobs flavor a100-large) |
| Wall time | 96.9 min for 1,425 optimizer steps (~4.08 s/step) |
| Cost | ~$4 of A100 time |
I smoke-tested the same code locally on a 48GB Apple M4 Pro (Metal/MPS) first. 23 s/step there vs ~4 s/step on the A100. The cloud run was a single submission via hf jobs uv run against my own script.
Results
| Train loss (first → last logged) | 1.0313 → 0.5630 |
| Train loss (averaged) | 0.5495 |
| Eval loss | 0.44 |
Eval loss is lower than train loss. No overfitting on the 12k-row run; the adapter would likely improve with a longer scale. The loss drops fastest in the first ~300 steps as the model picks up the chat format, then plateaus into a slower decline as it refines summary style.
Full per-step history is in metrics.json. The resolved training config is in config.json. Token-length distribution analysis is in length_stats.json.
What's still missing
- Narrow domain. Training data leans on personal/professional emails, news articles, and short reports. Anything outside that (legal text, code, dense technical writing) likely degrades.
- No
/thinkmode. I only trained on_no_thinkdata. Forcingenable_thinking=Trueat inference is out of distribution. - English only.
- No preference alignment yet. That's U2 of the arc (DPO on summary preferences). This SFT stage just teaches the format and the task.
- No safety tuning. Inherits the base model's behavior on harmlessness, which is none.
- Sample efficiency unexplored. A longer run, a higher LoRA rank, a full fine-tune on a larger cloud box. I haven't ablated any of it. Plenty of headroom.
Related artifacts
- tuggspeedman-ai/SmolLM3-3B-trl-cli-demo is the same SFT recipe reproduced via TRL's stock
sft.pyCLI on a smaller dataset. That's Exercise 4 of the course (the production CLI workflow); this one is Exercise 3 (custom Python). - Course: HF Smol Fine-Tuning Language Models, Unit 1.
- Portfolio: jonathanavni.com/projects.
Citation
@misc{avni2026smollm3summarize,
title = {SmolLM3-3B-summarize-sft-lora},
author = {Avni, Jonathan},
year = {2026},
url = {https://huggingface.co/tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora},
}
- Downloads last month
- 44
Model tree for tuggspeedman-ai/SmolLM3-3B-summarize-sft-lora
Base model
HuggingFaceTB/SmolLM3-3B-Base