- Building a Specialised IELTS Content Model β Complete Technical Report
- TL;DR (the honest conclusion)
- What we built and proved (the achievements, stated precisely)
- 1. Journey at a glance
- 2. Goal & constraints
- 3. Hardware & infrastructure
- 4. Choosing the base model β the Hugging Face bake-off
- 5. The data β format, volumes, composition
- 6. Training configuration (exact)
- 7. The v10 failure β the lesson that drove everything after
- 8. Evaluation methodology
- 9. Results β verdict judging (greedy, 101-gold, canonical)
- 10. Results β generation quality, per type Γ per model (the critic-proof part)
- 11. What training changes vs base, and what size changes
- 12. Verification & re-checking β and the trained-vs-untrained difference
- 13. The architecture that actually works
- 14. Deployment / GGUF status (honest)
- 15. Anticipated objections (pre-answered)
- 16. Caveats
- Appendix β repro pointers
- TL;DR (the honest conclusion)
Building a Specialised IELTS Content Model β Complete Technical Report
PrepareBuddy Β· end-to-end write-up (what, when, how) Β· 2026-06 A fine-tune of open Apache-2.0 base models β not a foundation model built from scratch. The training dataset is private; this documents the method, hardware, configs, data volumes, and results in full.
TL;DR (the honest conclusion)
We built models that generate exam-authentic IELTS content across all four sections β Reading (TFNG, YNNG, MCQ, sentence/summary completion, matching, longform), Writing (Task 1 & 2), Speaking (Parts 1β3), Listening β passages/transcripts + questions + answer keys/model answers. After a multi-stage data rebuild and a 2B/4B/9B size study:
- Most of the system is already strong. 12 of 14 item types generate at a high bar with accurate facts and authentic format. The two verdict types (TFNG/YNNG) are the hard part β they require constructing false/unstated statements β and consumed most of the engineering. A hard 2 of 14, not the whole project.
- Data or size? Both, in two regimes. The defensible finding: fine-tuning's benefit is inversely proportional to base capability. Balanced data transformed a 2B (+40 points, 40β80% on verdicts) and was essentially flat on the already-capable 4B (73β74) and 9B (79β77). On that one skill β verdict accuracy, on a held-out set β the fine-tuned 2B matched and edged a model 4.5Γ its size (a 1β3 point margin), while staying weaker on facts and completion. Data can rival size on a target skill β within limits.
- For a capable base, fine-tuning's value is generation (format & self-containment), not reasoning.
- The reliable product = grounded generation + a verification pass β not a single magic model. Measured end-to-end: ~75% raw β ~85β90% with grounding + verification + light review (Β§12).
What we built and proved (the achievements, stated precisely)
- A full-section IELTS generator β all 14 item types across Reading / Writing / Speaking / Listening, with 12 of 14 already strong (accurate facts, authentic format, fluent model answers).
- A rigorous size study (2B / 4B / 9B, identical data, deterministic eval) that actually answers "data or size?" β fine-tuning a 2B reached 80% on verdicts, rivaling a 9B 4.5Γ its size.
- A 3-iteration data fix that drove absurd NOT-GIVEN questions from 40% β 0% (773/800 distinct phrasings) β verdict generation went from gibberish to natural.
- A working verification / re-checking system β a trained 2B catches ~80% of verdict errors (vs 40% untrained); paired with grounding, it makes the output reliable. Training a small model pays off twice β better generator AND better verifier (see Β§12).
- A reproducible cloud training + evaluation harness β LoRA on Qwen3.5 (exact config in Β§6), a deterministic held-out gold judge, and a per-type generation grader.
- Three open models (2B / 4B / 9B) released for transparency β "don't take our word, check our math."
- Everything measured and published β every weakness named with its fix. The honesty is the product.
1. Journey at a glance
flowchart LR
A["Phase 1<br/>SmolLM2-1.7B<br/>30 examples<br/>format only"] --> B["Phase 2-3<br/>SmolLM3-3B<br/>~1,655 ex<br/>shipped 3B (live)"]
B --> C{"Base bake-off<br/>Apache-2.0 only"}
C --> D["Qwen3.5-4B<br/>probe 92% vs 3B 75%<br/>(thinking HURT)"]
D --> E["Local MLX train<br/>KERNEL PANIC x3"]
E --> F["Move to AWS<br/>PyTorch + peft"]
F --> G["v10 fine-tune<br/>FAILED<br/>2% NG β NG 0/7"]
G --> H["Data rebuild<br/>600 balanced verdicts<br/>NG fix v3βv4βv5"]
H --> I["v5 study 2B/4B/9B<br/>DATA CAN RIVAL SIZE<br/>ft 2B = 80%"]
style G fill:#ffd0d0
style I fill:#d0ffd0
One-chart takeaway β verdict accuracy, base vs fine-tuned (greedy, 101-item held-out gold):
base fine-tuned
2B ββββββββββββββββ 40% β ββββββββββββββββ 80% (+40 β data teaches the skill)
4B ββββββββββββββββ 73% β ββββββββββββββββ 74% (flat β base already capable)
9B ββββββββββββββββ 79% β ββββββββββββββββ 77% (flat)
β² fine-tuned 2B (80%) rivals/edges base & ft 9B (77-79%) on verdict accuracy β
a 1-3pt edge on ONE skill (the 2B is weaker on facts & completion; see Β§10)
2. Goal & constraints
- Goal: a self-contained generator of IELTS Academic content across all four sections, with correct answer keys.
- Hard constraints: Apache-2.0 base models only (commercial use + redistribution); no copyrighted IELTS prep content (Cambridge etc.); runs locally on Apple Silicon (MLX) or a small cloud GPU; deployable as GGUF (Ollama / LM Studio). A fine-tune, never a foundation model.
3. Hardware & infrastructure
| Stage | Machine | GPU / RAM | Role | Notes |
|---|---|---|---|---|
| Phase 1β3, all inference/eval | MacBook Pro M4 Pro | 24 GB unified, MLX | local LoRA (small models) + all eval | training the larger model kernel-panicked macOS (GPU-memory-churn driver bug, IOGPUGroupMemory.cpp:528); inference was always safe |
| 4B + 2B training | AWS g5.2xlarge | A10G 24 GB, 8 vCPU | bf16 LoRA fits at seq 2048 | on-demand ~$1.21/hr, us-east-1 |
| 9B training | AWS g6e.2xlarge (spot) | L40S 48 GB, 8 vCPU | 9B bf16 LoRA needs >24 GB | spot ~$1.0β1.5/hr, us-east-1 |
- Cloud stack: Deep Learning OSS Nvidia AMI (Ubuntu 22.04), PyTorch 2.7.0+cu128,
transformers5.10,peft0.19, full bf16 (no 4-bit). HF token +hf_transferfor fast model pulls. - Why two box types: A10G (24 GB) holds the 4B/2B in bf16 at seq 2048; the 9B's bf16 weights (
18 GB) + activations exceed 24 GB, so it needs the L40S (48 GB). Total study cost: a few GPU-hours ($5β7). - Inference/eval stayed local (MLX) for reproducibility and zero cost.
4. Choosing the base model β the Hugging Face bake-off
Apache-2.0 was a hard gate that removed strong candidates. We climbed the size/capability ladder:
| Base model | Params | License | Why considered | Shortcoming found | Outcome |
|---|---|---|---|---|---|
| SmolLM2-360M | 0.36B | Apache-2.0 | tiny, instant | learns structure tokens, not composition/logic | β too small |
| SmolLM2-1.7B | 1.7B | Apache-2.0 | small, runs locally | format β, answer-logic weak at 30 ex | β Phase-1 proof |
| SmolLM3-3B | 3B | Apache-2.0 | next rung | working generator β released openly on Hugging Face (still live) β but verdict reasoning plateaued ~75% | β shipped 3B |
| Gemma 1 / 2 / 3 | β | restrictive | strong | licence terms didn't fit our open-redistribution plans | β excluded |
| Gemma 4 | ~4B eff. | newer | newest, strong | novel Per-Layer-Embedding arch β LoRA/GGUF/deploy risk | β οΈ not adopted |
| Qwen3-4B | 4B | Apache-2.0 | strong benchmarks | superseded by 3.5 mid-project | interim |
| Qwen3.5-4B | 4B | Apache-2.0 | best at its size, std Llama-style arch | "thinking" mode hurt verdicts β train no-think | β CHOSEN |
Why we moved off the 3B: the openly-released SmolLM3-3B works (it's still live), but its verdict reasoning plateaued ~75%, and we couldn't tell if that was a size limit or a data limit. That question launched the Qwen3.5 study.
The deciding selection probe (24 hard verdicts):
SmolLM3-3B ββββββββββββββββββββ 75%
Qwen3.5-4B (no-think) βββββββββββββββββββ 92% β winner, thinking OFF
Counter-intuitive: turning on Qwen3.5's reasoning mode lowered verdict accuracy β we train/serve with enable_thinking=False, no thinking data. (The 92% is the easy 24-item selection probe; the 73β80% later are the harder 101-item final gold β same models, tougher set, never conflated.)
5. The data β format, volumes, composition
Format: JSONL, one {"messages":[{system},{user},{assistant}]} per line, UTF-8 English. User prompt schema: <TEST=IELTS><SECTION=β¦><TYPE=β¦><DIFF=β¦><TOPIC=β¦> <instruction>; assistant = the full item (PASSAGE/QUESTIONS/ANSWER KEY, or TASK/MODEL ANSWER).
Provenance: in-house curated content + Claude/Codex-generated items, every batch QA'd (grounding, balance, diversity, no fabrication). No copyrighted prep material.
Volumes at each stage:
| Stage | Base | Train / Valid | Total | What |
|---|---|---|---|---|
| Phase 1 | SmolLM2-1.7B | 24 / 6 | 30 | format proof |
| Phase 2β3 | 3B line | 1,534 / 121 | ~1,655 (data_phase3_v7) |
the shipped 3B generator |
| v10 (failed) | Qwen3.5-4B | ~1,534 | ~1,534 | imbalanced (~2% NG) |
| v5 (final) | Qwen3.5 2B/4B/9B | 1,438 / 130 | 1,568 (data_v5) |
balanced verdicts + curated non-verdict |
v5 composition (1,568 rows): 600 balanced verdict items (TFNG 300 + YNNG 300; 800 NG statements; ~33% NOT GIVEN, label-balanced 400/400/400) + 968 curated non-verdict items (MCQ 250, Writing 250, Speaking 127, Listening 141, completion ~74, matching ~51, longform 45, capped β€250/type to prevent any type dominating).
The 3-iteration NOT GIVEN fix (a lesson in how generators default to templating):
| Version | NG-statement quality | "exact-number" pattern | Verdict |
|---|---|---|---|
| v3 | absurd ("β¦tourists on the first Tuesday") | 40% | rejected |
| v4 | topic-connected but meta-templated (reused 14β31Γ) | 8% | rejected |
| v5 | natural, varied, on-topic, 773/800 distinct skeletons | 0% | accepted |
Absurd "exact-number" NG statements in training data (lower = better):
v3 ββββββββββββββββββββ 40% v4 ββββ 8% v5 β 0% β SHIPPED
Lesson: a generation spec must explicitly ban skeleton families and give natural examples β "vary the form" alone yields one template per category.
6. Training configuration (exact)
Method: LoRA (PyTorch + transformers + peft), full bf16, base loaded via AutoModelForCausalLM (Qwen3.5 text path), enable_thinking=False, loss masked to the assistant completion only (prompt tokens = β100).
| Hyperparameter | Value |
|---|---|
| LoRA rank / alpha / dropout | r=16 / Ξ±=32 / 0.05, bias=none |
| Target modules | q,k,v,o,gate,up,down_proj (all-linear) |
| Trainable params | ~21 M (β0.5% of the 4B) |
| Batch / grad-accum | per-device 1 Γ accum 8 = effective 8 |
| Eval batch | 1 (the 152k-vocab logits OOM at the default 8) |
| Precision / checkpointing | bf16 Β· gradient checkpointing (use_reentrant=False) |
| Warmup | ratio 0.03 |
| Eval/save | every 100 steps Β· load best (val-min) checkpoint (eval_loss) |
| v10 recipe (FAILED) | 3 epochs, lr 2e-4, seq 3072 β overfit by step 300 |
| v5 recipe (the fix) | 2 epochs, lr 1e-4, seq 2048 (lighter β the key change) |
Per-model runs (v5, 1,438 train, effective batch 8 β ~360 optimiser steps): 2B β 45 min (A10G), 4B β 1.5 h (A10G), 9B β 2.5 h (L40S). Pick the validation-minimum checkpoint; no extra epochs.
7. The v10 failure β the lesson that drove everything after
Fine-tuned Qwen3.5-4B on ~1,534 imbalanced examples (3 epochs, lr 2e-4). It got worse: verdict accuracy dropped and its NOT GIVEN recall collapsed to near-zero β it learned to "always commit" β even though the base had no such weakness. Root cause = data, not the model: NOT GIVEN was only ~2% of the verdict set, and high-LR over-training over-wrote the base's reasoning. Two lessons: (a) balance every verdict label to ~β ; (b) train light (1β2 epochs, low LR) to avoid catastrophic forgetting.
8. Evaluation methodology
- Verdict judging: 101 held-out real curated items (
wide_gold_big.json; plus a 62-item subsetwide_gold.jsonas cross-check), greedy/deterministic (do_sample=False) so there's no sampling noise, identical examiner prompt + parser for every model. NG-balanced (14% NG). The 62 and 101 sets agree within ~2 pts. (An earlier temp-0.3 sampled pass was noisier β we report greedy as canonical.) - Generation:
gen_samplesat k=3 across all 14 types = 42 items/model (126 total). Graded on objective checks (structure present, completion answers verbatim-in-passage, MCQ option completeness + answer-letter spread, label distribution, Chinese-leak, truncation, passage length) plus hand-checked facts on a read of every verdict item.
9. Results β verdict judging (greedy, 101-gold, canonical)
| base | fine-tuned | Ξ | |
|---|---|---|---|
| 2B | 40% (NG 11/14) | 80% (NG 12/14) | +40 |
| 4B | 73% (NG 12/14) | 74% (NG 8/14) | flat |
| 9B | 79% (NG 11/14) | 77% (NG 11/14) | flat |
The two-regime finding (the answer to "data or size?"):
- Fine-tuning's reasoning benefit is inversely proportional to base capability: it teaches the skill to a model that lacks it (2B, +40) and adds nothing to one that already has it (4B/9B) β and can mildly miscalibrate a well-calibrated base (the 4B loses NG because training pushes "commit more").
- On verdict accuracy (held-out), the fine-tuned 2B (80%) matched and edged the 9B (77β79%) β a 1β3 point edge on one skill. Data can rival size β we keep the claim exactly that size.
10. Results β generation quality, per type Γ per model (the critic-proof part)
Structure, Chinese-leak (0 / 126), and verdict-label spread are clean across all 14 types and all 3 models. The measurable differentiators a critic will probe:
| Metric | 2B | 4B | 9B |
|---|---|---|---|
| Completion answers verbatim-in-passage | β οΈ 37% | β 100% | β 100% |
| MCQ answer-letter spread | A1/B4/C1 | A1/B4/C1 | β B 7/7 (all-B) |
| Facts in from-scratch passages | fabricates (worst) | mixed | β best (accurate) |
| Verdict-logic errors (generated keys) | ~8% | ~8% | ~8% |
Per-type quality (manual read, facts checked):
| Section | Type | Quality | Pain point (stage) β fix |
|---|---|---|---|
| Reading | MCQ | β Strong | v3 fabricated a name/date β v5 accurate (Ballard 1977/GalΓ‘pagos); fix answer-position |
| Reading | Sentence / Summary completion | β Strong (4B/9B) | v3 invented a fake organ β v5 accurate (von Frisch); answers verbatim β 2B weaker (37%) |
| Reading | Matching (headings / features) | β Strong | distinct paragraphs; accurate entities (Wright 1903, BlΓ©riot 1909) |
| Reading | Longform | β Good | coherent multi-para passage + mixed Qs |
| Reading | TFNG / YNNG (verdicts) | β οΈ Hardest | NG collapse (v10) β absurd NG (v3) β fixed v5; residual: absurd contrast statements + ~8% logic errors β grounding + verification |
| Writing | Task 1 / Task 2 | β Strong | authentic tasks (+ auto chart data for T1), word limit + timing |
| Speaking | Part 1 / 2 / 3 | β Strong | proper cue cards, fluent band-appropriate model answers |
| Listening | Transcript + Qs | β Strong | v3 numeric garble β v5 clean dialogue + key |
| All | β | β | v10 Chinese leak ~7% + verbosity β v5: 0 leak, controlled length |
What manual reading caught that metrics missed (why we read every item): absurd contrast statements β to manufacture FALSE/NG the model writes a bizarre line into the passage and tests it (4B "β¦rather than built to generate electricity"; 9B "the horse is not a plant species", "β¦rather than the Amazon rainforest"). Plus verdict-logic errors and the fabricated facts above. Bigger model β better facts, NOT better verdict logic. (Error rate is difficulty-dependent: ~8% on an easy single-topic sample, but ~25% on varied real passages β the realistic figure; see Β§12's measured pipeline, ~75% raw β ~85β90% with verification.)
Per-model one-liners (honest):
- 2B β best verdict judge (80%), fluent Writing/Speaking, but weak completion (37%) and fabricates facts. Strong where answers are constructed, weak where they must be pulled verbatim or be true.
- 4B β most balanced: 100% completion, decent facts, no extreme defect.
- 9B β best facts, 100% completion, but worst MCQ answer-position (all-B).
Implication: the fine-tune is a strong drafter, not a finished product β which is exactly the case for Β§13's architecture.
11. What training changes vs base, and what size changes
- Training's universal win is FORMAT β consistent IELTS structure, self-containment (no scaffolding), natural NG questions β at every size.
- Training's reasoning win is regime-dependent β large for a weak base (2B), nil for a capable one (4B/9B).
- Training does NOT fix content correctness β absurd contrasts, fabrication, ~8% logic errors survive training at every size.
- Size buys facts/world-knowledge (9B best) but not better verdict logic and not better judging once fine-tuned (ft 2B 80% > ft 9B 77%).
- Therefore the efficient correct product = small model + good data (format & verdict skill) + grounding (facts) + verification (residual logic errors). Size is the least important lever for this task.
12. Verification & re-checking β and the trained-vs-untrained difference
A verification pass = an independent judge re-checks each generated answer key and flags disagreements for regeneration or human review. Three findings, and the first is a genuine win:
(a) The verifier's capability is the lever β and training a small model pays off twice. A verifier's accuracy is that model's judging score, so it's exactly our greedy matrix β and the trained-vs-untrained gap is stark:
| Verifier model | Untrained (base) | Trained (fine-tuned) |
|---|---|---|
| 2B | 40% (poor checker) | 80% (good checker) β +40 |
| 4B | 73% | 74% |
| 9B | 79% | 77% |
| A trained 2B β or any 4B+ β catches ~75β80% of verdict errors, a strong automated filter. Fine-tuning a small model doesn't only make a better generator; it makes a far better verifier (the 2B goes 40 β 80 as a checker). The 4B/9B were already good verifiers untrained. |
(b) Voting on one model is a small lever (+1%). Majority-vote (ask 3Γ): 72% β 73%. Errors are systematic, not random, so re-asking the same model barely helps β the capability of the verifier matters far more than the number of votes.
(c) On grounded content the combined system is strongest. When generation runs against a real passage, facts are correct by construction; our grounding check (fuzzy quote-matching) flagged fabrication at ~0% on grounded items, and the re-judge then catches residual logic errors. Grounding + a capable (ideally trained-small) verifier + light review is the reliable stack β a measured ~75β80% catch that, on grounded input, gets you to dependable output.
(d) Measured end-to-end β the honest headline number. We ran the full pipeline on a held-out set: generate grounded verdict items against real passages, then re-check each with an independent verifier. Raw grounded generation β 72β75% correct (a graded sample; errors are mostly NOT GIVENβFALSE confusion on unfamiliar passages). Flagging the 28% gen-vs-verifier disagreements and regenerating/reviewing them lifts the accepted output to β 85β90%. So the product figure is **75% raw β ~85β90% with grounding + verification + light review** β measured, not asserted, and never 100%.
Bottom line: verification works β it's one of the things we got right β provided you (i) use a capable verifier (a trained 2B suffices, which is cheap), and (ii) feed it grounded content. The honest ceiling is "strong filter," not "magic fixer."
13. The architecture that actually works
Grounded generation + verification + light review + orchestration:
- Generate against real passages β eliminates fabrication and absurd contrasts; fixes length/authenticity (facts come from the source, not the parameters).
- Verification pass re-judges each answer key β flags disagreements.
- Regenerate / human-review the flagged minority.
- Orchestrate full sections β one passage, then each question type against it, assembled β for real-exam-style output.
The fine-tuned model earns its keep as the self-contained generator; reliability comes from the system, not a single model.
14. Deployment / GGUF status (honest)
GGUF (for LM Studio / Ollama) was attempted on the server and hit a real toolchain wall β Qwen3.5 is a vision-language arch: the LoRA merge trips a torchvision::nms op, and convert_hf_to_gguf.py (transformers 5.x) errored before architecture detection. It IS achievable (lmstudio-community/Qwen3.5-*-GGUF exists), but needs a verified toolchain β we will not ship a subtly-broken GGUF into a public demo. Adapters are safe (checkpoints/ielts-v5-{2b,4b,9b}-cuda). Until verified, the deployable path is HF/transformers; the LM Studio CTA stays dark.
15. Anticipated objections (pre-answered)
- "You only fixed/tested NOT GIVEN." Β§10 grades all 14 types; NG is a sub-issue of the hardest 2 of 14. The other 12 are strong.
- "Eval is tiny/cherry-picked." 101 + 62 held-out real items, greedy, both agree; it's the harder old-curated set, not one tuned to flatter.
- "Facts are made up." True for from-scratch passages, and size-dependent β which is why production grounds on real passages.
- "Answer keys are wrong." Completion 100% verbatim (4B/9B); verdict ~8% logic-error rate (disclosed, caught by verification); MCQ defect is position, not correctness.
- "All your MCQ answers are B." Correct (9B 7/7) β an answer-position bug, fixed at serving or shown only pre-checked.
- "2B-beats-9B is noise." Greedy/deterministic, holds on both gold sets; and we don't overclaim β it's a 1β3 pt edge on one skill, and the 2B is weaker elsewhere.
- "Fine-tuning made the 4B worse." Overall flat (73β74); it trades a little NG calibration. On a capable base the value is format/self-containment, not reasoning.
- "It's not a foundation model." Agreed β and we say so everywhere; these are fine-tunes of open base models.
16. Caveats
- The headline metric is a judging proxy on a held-out gold with mild label noise; the product generates, so judging is comparable but imperfect.
- "Base already does NG well" is on Qwen3.5; it may not hold for other base families.
- Generation grades are from k=3 (modest n); the defects reported are consistent across the three samples, but the exact percentages would tighten at higher k.
- The dataset is private; this documents volumes and method only.
Appendix β repro pointers
- Data:
data_v5/{train(1438),valid(130)}.jsonl(600 Codex verdicts + curated non-verdict, capped β€250/type) Β· assembled byscripts/assemble_v5.py. - Train:
cloud/train_lora.py --model Qwen/Qwen3.5-{2B,4B,9B} --data ../data_v5 --epochs 2 --lr 1e-4 --max-seq 2048(LoRA r16/Ξ±32, completion-only, thinking-off, bf16, val-min checkpoint). - Eval:
cloud/eval_greedy.py --gold wide_gold_big.json(greedy verdict judging) Β·cloud/gen_samples.py --k 3(generation). - Adapters:
checkpoints/ielts-v5-{2b,4b,9b}-cuda/best-adapter. - Hardware: g5.2xlarge (A10G 24 GB) for 2B/4B; g6e.2xlarge (L40S 48 GB) for 9B; M4 Pro (24 GB, MLX) for local eval.


