Building a Specialised IELTS Content Model — Complete Technical Report

PrepareBuddy · end-to-end write-up (what, when, how) · 2026-06 A fine-tune of open Apache-2.0 base models — not a foundation model built from scratch. The training dataset is private; this documents the method, hardware, configs, data volumes, and results in full.

TL;DR (the honest conclusion)

We built models that generate exam-authentic IELTS content across all four sections — Reading (TFNG, YNNG, MCQ, sentence/summary completion, matching, longform), Writing (Task 1 & 2), Speaking (Parts 1–3), Listening — passages/transcripts + questions + answer keys/model answers. After a multi-stage data rebuild and a 2B/4B/9B size study:

Most of the system is already strong. 12 of 14 item types generate at a high bar with accurate facts and authentic format. The two verdict types (TFNG/YNNG) are the hard part — they require constructing false/unstated statements — and consumed most of the engineering. A hard 2 of 14, not the whole project.
Data or size? Both, in two regimes. The defensible finding: fine-tuning's benefit is inversely proportional to base capability. Balanced data transformed a 2B (+40 points, 40→80% on verdicts) and was essentially flat on the already-capable 4B (73→74) and 9B (79→77). On that one skill — verdict accuracy, on a held-out set — the fine-tuned 2B matched and edged a model 4.5× its size (a 1–3 point margin), while staying weaker on facts and completion. Data can rival size on a target skill — within limits.
For a capable base, fine-tuning's value is generation (format & self-containment), not reasoning.
The reliable product = grounded generation + a verification pass — not a single magic model. Measured end-to-end: ~75% raw → ~85–90% with grounding + verification + light review (§12).

What we built and proved (the achievements, stated precisely)

A full-section IELTS generator — all 14 item types across Reading / Writing / Speaking / Listening, with 12 of 14 already strong (accurate facts, authentic format, fluent model answers).
A rigorous size study (2B / 4B / 9B, identical data, deterministic eval) that actually answers "data or size?" — fine-tuning a 2B reached 80% on verdicts, rivaling a 9B 4.5× its size.
A 3-iteration data fix that drove absurd NOT-GIVEN questions from 40% → 0% (773/800 distinct phrasings) — verdict generation went from gibberish to natural.
A working verification / re-checking system — a trained 2B catches ~80% of verdict errors (vs 40% untrained); paired with grounding, it makes the output reliable. Training a small model pays off twice — better generator AND better verifier (see §12).
A reproducible cloud training + evaluation harness — LoRA on Qwen3.5 (exact config in §6), a deterministic held-out gold judge, and a per-type generation grader.
Three open models (2B / 4B / 9B) released for transparency — "don't take our word, check our math."
Everything measured and published — every weakness named with its fix. The honesty is the product.

1. Journey at a glance

flowchart LR
    A["Phase 1<br/>SmolLM2-1.7B<br/>30 examples<br/>format only"] --> B["Phase 2-3<br/>SmolLM3-3B<br/>~1,655 ex<br/>shipped 3B (live)"]
    B --> C{"Base bake-off<br/>Apache-2.0 only"}
    C --> D["Qwen3.5-4B<br/>probe 92% vs 3B 75%<br/>(thinking HURT)"]
    D --> E["Local MLX train<br/>KERNEL PANIC x3"]
    E --> F["Move to AWS<br/>PyTorch + peft"]
    F --> G["v10 fine-tune<br/>FAILED<br/>2% NG → NG 0/7"]
    G --> H["Data rebuild<br/>600 balanced verdicts<br/>NG fix v3→v4→v5"]
    H --> I["v5 study 2B/4B/9B<br/>DATA CAN RIVAL SIZE<br/>ft 2B = 80%"]
    style G fill:#ffd0d0
    style I fill:#d0ffd0

One-chart takeaway — verdict accuracy, base vs fine-tuned (greedy, 101-item held-out gold):

            base                    fine-tuned
  2B   ████████░░░░░░░░ 40%   →   ████████████████ 80%   (+40  ← data teaches the skill)
  4B   ██████████████░░ 73%   →   ███████████████░ 74%   (flat ← base already capable)
  9B   ████████████████ 79%   →   ███████████████░ 77%   (flat)
       ▲ fine-tuned 2B (80%) rivals/edges base & ft 9B (77-79%) on verdict accuracy —
         a 1-3pt edge on ONE skill (the 2B is weaker on facts & completion; see §10)

2. Goal & constraints

Goal: a self-contained generator of IELTS Academic content across all four sections, with correct answer keys.
Hard constraints: Apache-2.0 base models only (commercial use + redistribution); no copyrighted IELTS prep content (Cambridge etc.); runs locally on Apple Silicon (MLX) or a small cloud GPU; deployable as GGUF (Ollama / LM Studio). A fine-tune, never a foundation model.

3. Hardware & infrastructure

Stage	Machine	GPU / RAM	Role	Notes
Phase 1–3, all inference/eval	MacBook Pro M4 Pro	24 GB unified, MLX	local LoRA (small models) + all eval	training the larger model kernel-panicked macOS (GPU-memory-churn driver bug, `IOGPUGroupMemory.cpp:528`); inference was always safe
4B + 2B training	AWS g5.2xlarge	A10G 24 GB, 8 vCPU	bf16 LoRA fits at seq 2048	on-demand ~$1.21/hr, us-east-1
9B training	AWS g6e.2xlarge (spot)	L40S 48 GB, 8 vCPU	9B bf16 LoRA needs >24 GB	spot ~$1.0–1.5/hr, us-east-1

Cloud stack: Deep Learning OSS Nvidia AMI (Ubuntu 22.04), PyTorch 2.7.0+cu128, transformers 5.10, peft 0.19, full bf16 (no 4-bit). HF token + hf_transfer for fast model pulls.
Why two box types: A10G (24 GB) holds the 4B/2B in bf16 at seq 2048; the 9B's bf16 weights (~~18 GB) + activations exceed 24 GB, so it needs the L40S (48 GB). Total study cost: a few GPU-hours (~~$5–7).
Inference/eval stayed local (MLX) for reproducibility and zero cost.

4. Choosing the base model — the Hugging Face bake-off

Apache-2.0 was a hard gate that removed strong candidates. We climbed the size/capability ladder:

Base model	Params	License	Why considered	Shortcoming found	Outcome
SmolLM2-360M	0.36B	Apache-2.0	tiny, instant	learns structure tokens, not composition/logic	❌ too small
SmolLM2-1.7B	1.7B	Apache-2.0	small, runs locally	format ✓, answer-logic weak at 30 ex	✅ Phase-1 proof
SmolLM3-3B	3B	Apache-2.0	next rung	working generator — released openly on Hugging Face (still live) — but verdict reasoning plateaued ~75%	✅ shipped 3B
Gemma 1 / 2 / 3	—	restrictive	strong	licence terms didn't fit our open-redistribution plans	❌ excluded
Gemma 4	~4B eff.	newer	newest, strong	novel Per-Layer-Embedding arch → LoRA/GGUF/deploy risk	⚠️ not adopted
Qwen3-4B	4B	Apache-2.0	strong benchmarks	superseded by 3.5 mid-project	interim
Qwen3.5-4B	4B	Apache-2.0	best at its size, std Llama-style arch	"thinking" mode hurt verdicts → train no-think	✅ CHOSEN

Why we moved off the 3B: the openly-released SmolLM3-3B works (it's still live), but its verdict reasoning plateaued ~75%, and we couldn't tell if that was a size limit or a data limit. That question launched the Qwen3.5 study.

The deciding selection probe (24 hard verdicts):

  SmolLM3-3B            ███████████████░░░░░ 75%
  Qwen3.5-4B (no-think) ██████████████████▌  92%   ← winner, thinking OFF

Counter-intuitive: turning on Qwen3.5's reasoning mode lowered verdict accuracy → we train/serve with enable_thinking=False, no thinking data. (The 92% is the easy 24-item selection probe; the 73–80% later are the harder 101-item final gold — same models, tougher set, never conflated.)

5. The data — format, volumes, composition

Format: JSONL, one {"messages":[{system},{user},{assistant}]} per line, UTF-8 English. User prompt schema: <TEST=IELTS><SECTION=…><TYPE=…><DIFF=…><TOPIC=…> <instruction>; assistant = the full item (PASSAGE/QUESTIONS/ANSWER KEY, or TASK/MODEL ANSWER).

Provenance: in-house curated content + Claude/Codex-generated items, every batch QA'd (grounding, balance, diversity, no fabrication). No copyrighted prep material.

Volumes at each stage:

Stage	Base	Train / Valid	Total	What
Phase 1	SmolLM2-1.7B	24 / 6	30	format proof
Phase 2–3	3B line	1,534 / 121	~1,655 (`data_phase3_v7`)	the shipped 3B generator
v10 (failed)	Qwen3.5-4B	~1,534	~1,534	imbalanced (~2% NG)
v5 (final)	Qwen3.5 2B/4B/9B	1,438 / 130	1,568 (`data_v5`)	balanced verdicts + curated non-verdict

v5 composition (1,568 rows): 600 balanced verdict items (TFNG 300 + YNNG 300; 800 NG statements; ~33% NOT GIVEN, label-balanced 400/400/400) + 968 curated non-verdict items (MCQ 250, Writing 250, Speaking 127, Listening 141, completion ~74, matching ~51, longform 45, capped ≤250/type to prevent any type dominating).

The 3-iteration NOT GIVEN fix (a lesson in how generators default to templating):

Version	NG-statement quality	"exact-number" pattern	Verdict
v3	absurd ("…tourists on the first Tuesday")	40%	rejected
v4	topic-connected but meta-templated (reused 14–31×)	8%	rejected
v5	natural, varied, on-topic, 773/800 distinct skeletons	0%	accepted

Absurd "exact-number" NG statements in training data (lower = better):
  v3  ████████████████████ 40%   v4  ████ 8%   v5  ▏ 0%  ✓ SHIPPED

Lesson: a generation spec must explicitly ban skeleton families and give natural examples — "vary the form" alone yields one template per category.

6. Training configuration (exact)

Method: LoRA (PyTorch + transformers + peft), full bf16, base loaded via AutoModelForCausalLM (Qwen3.5 text path), enable_thinking=False, loss masked to the assistant completion only (prompt tokens = −100).

Hyperparameter	Value
LoRA rank / alpha / dropout	r=16 / α=32 / 0.05, bias=none
Target modules	`q,k,v,o,gate,up,down_proj` (all-linear)
Trainable params	~21 M (≈0.5% of the 4B)
Batch / grad-accum	per-device 1 × accum 8 = effective 8
Eval batch	1 (the 152k-vocab logits OOM at the default 8)
Precision / checkpointing	bf16 · gradient checkpointing (`use_reentrant=False`)
Warmup	ratio 0.03
Eval/save	every 100 steps · load best (val-min) checkpoint (`eval_loss`)
v10 recipe (FAILED)	3 epochs, lr 2e-4, seq 3072 → overfit by step 300
v5 recipe (the fix)	2 epochs, lr 1e-4, seq 2048 (lighter — the key change)

Per-model runs (v5, 1,438 train, effective batch 8 → ~360 optimiser steps): 2B ≈ 45 min (A10G), 4B ≈ 1.5 h (A10G), 9B ≈ 2.5 h (L40S). Pick the validation-minimum checkpoint; no extra epochs.

7. The v10 failure — the lesson that drove everything after

Fine-tuned Qwen3.5-4B on ~1,534 imbalanced examples (3 epochs, lr 2e-4). It got worse: verdict accuracy dropped and its NOT GIVEN recall collapsed to near-zero — it learned to "always commit" — even though the base had no such weakness. Root cause = data, not the model: NOT GIVEN was only ~2% of the verdict set, and high-LR over-training over-wrote the base's reasoning. Two lessons: (a) balance every verdict label to ~⅓; (b) train light (1–2 epochs, low LR) to avoid catastrophic forgetting.

8. Evaluation methodology

Verdict judging: 101 held-out real curated items (wide_gold_big.json; plus a 62-item subset wide_gold.json as cross-check), greedy/deterministic (do_sample=False) so there's no sampling noise, identical examiner prompt + parser for every model. NG-balanced (14% NG). The 62 and 101 sets agree within ~2 pts. (An earlier temp-0.3 sampled pass was noisier — we report greedy as canonical.)
Generation: gen_samples at k=3 across all 14 types = 42 items/model (126 total). Graded on objective checks (structure present, completion answers verbatim-in-passage, MCQ option completeness + answer-letter spread, label distribution, Chinese-leak, truncation, passage length) plus hand-checked facts on a read of every verdict item.

9. Results — verdict judging (greedy, 101-gold, canonical)

	base	fine-tuned	Δ
2B	40% (NG 11/14)	80% (NG 12/14)	+40
4B	73% (NG 12/14)	74% (NG 8/14)	flat
9B	79% (NG 11/14)	77% (NG 11/14)	flat

The two-regime finding (the answer to "data or size?"):

Fine-tuning's reasoning benefit is inversely proportional to base capability: it teaches the skill to a model that lacks it (2B, +40) and adds nothing to one that already has it (4B/9B) — and can mildly miscalibrate a well-calibrated base (the 4B loses NG because training pushes "commit more").
On verdict accuracy (held-out), the fine-tuned 2B (80%) matched and edged the 9B (77–79%) — a 1–3 point edge on one skill. Data can rival size — we keep the claim exactly that size.

10. Results — generation quality, per type × per model (the critic-proof part)

Structure, Chinese-leak (0 / 126), and verdict-label spread are clean across all 14 types and all 3 models. The measurable differentiators a critic will probe:

Metric	2B	4B	9B
Completion answers verbatim-in-passage	⚠️ 37%	✅ 100%	✅ 100%
MCQ answer-letter spread	A1/B4/C1	A1/B4/C1	❌ B 7/7 (all-B)
Facts in from-scratch passages	fabricates (worst)	mixed	✅ best (accurate)
Verdict-logic errors (generated keys)	~8%	~8%	~8%

Per-type quality (manual read, facts checked):

Section	Type	Quality	Pain point (stage) → fix
Reading	MCQ	✅ Strong	v3 fabricated a name/date → v5 accurate (Ballard 1977/Galápagos); fix answer-position
Reading	Sentence / Summary completion	✅ Strong (4B/9B)	v3 invented a fake organ → v5 accurate (von Frisch); answers verbatim — 2B weaker (37%)
Reading	Matching (headings / features)	✅ Strong	distinct paragraphs; accurate entities (Wright 1903, Blériot 1909)
Reading	Longform	✅ Good	coherent multi-para passage + mixed Qs
Reading	TFNG / YNNG (verdicts)	⚠️ Hardest	NG collapse (v10) → absurd NG (v3) → fixed v5; residual: absurd contrast statements + ~8% logic errors → grounding + verification
Writing	Task 1 / Task 2	✅ Strong	authentic tasks (+ auto chart data for T1), word limit + timing
Speaking	Part 1 / 2 / 3	✅ Strong	proper cue cards, fluent band-appropriate model answers
Listening	Transcript + Qs	✅ Strong	v3 numeric garble → v5 clean dialogue + key
All	—	—	v10 Chinese leak ~7% + verbosity → v5: 0 leak, controlled length

What manual reading caught that metrics missed (why we read every item): absurd contrast statements — to manufacture FALSE/NG the model writes a bizarre line into the passage and tests it (4B "…rather than built to generate electricity"; 9B "the horse is not a plant species", "…rather than the Amazon rainforest"). Plus verdict-logic errors and the fabricated facts above. Bigger model → better facts, NOT better verdict logic. (Error rate is difficulty-dependent: ~8% on an easy single-topic sample, but ~25% on varied real passages — the realistic figure; see §12's measured pipeline, ~75% raw → ~85–90% with verification.)

Per-model one-liners (honest):

2B — best verdict judge (80%), fluent Writing/Speaking, but weak completion (37%) and fabricates facts. Strong where answers are constructed, weak where they must be pulled verbatim or be true.
4B — most balanced: 100% completion, decent facts, no extreme defect.
9B — best facts, 100% completion, but worst MCQ answer-position (all-B).

Implication: the fine-tune is a strong drafter, not a finished product — which is exactly the case for §13's architecture.

11. What training changes vs base, and what size changes

Training's universal win is FORMAT — consistent IELTS structure, self-containment (no scaffolding), natural NG questions — at every size.
Training's reasoning win is regime-dependent — large for a weak base (2B), nil for a capable one (4B/9B).
Training does NOT fix content correctness — absurd contrasts, fabrication, ~8% logic errors survive training at every size.
Size buys facts/world-knowledge (9B best) but not better verdict logic and not better judging once fine-tuned (ft 2B 80% > ft 9B 77%).
Therefore the efficient correct product = small model + good data (format & verdict skill) + grounding (facts) + verification (residual logic errors). Size is the least important lever for this task.

12. Verification & re-checking — and the trained-vs-untrained difference

A verification pass = an independent judge re-checks each generated answer key and flags disagreements for regeneration or human review. Three findings, and the first is a genuine win:

(a) The verifier's capability is the lever — and training a small model pays off twice. A verifier's accuracy is that model's judging score, so it's exactly our greedy matrix — and the trained-vs-untrained gap is stark:

Verifier model	Untrained (base)	Trained (fine-tuned)
2B	40% (poor checker)	80% (good checker) — +40
4B	73%	74%
9B	79%	77%
A trained 2B — or any 4B+ — catches ~75–80% of verdict errors, a strong automated filter. Fine-tuning a small model doesn't only make a better generator; it makes a far better verifier (the 2B goes 40 → 80 as a checker). The 4B/9B were already good verifiers untrained.

(b) Voting on one model is a small lever (+1%). Majority-vote (ask 3×): 72% → 73%. Errors are systematic, not random, so re-asking the same model barely helps — the capability of the verifier matters far more than the number of votes.

(c) On grounded content the combined system is strongest. When generation runs against a real passage, facts are correct by construction; our grounding check (fuzzy quote-matching) flagged fabrication at ~0% on grounded items, and the re-judge then catches residual logic errors. Grounding + a capable (ideally trained-small) verifier + light review is the reliable stack — a measured ~75–80% catch that, on grounded input, gets you to dependable output.

(d) Measured end-to-end — the honest headline number. We ran the full pipeline on a held-out set: generate grounded verdict items against real passages, then re-check each with an independent verifier. Raw grounded generation ≈ 72–75% correct (a graded sample; errors are mostly NOT GIVEN↔FALSE confusion on unfamiliar passages). Flagging the 28% gen-vs-verifier disagreements and regenerating/reviewing them lifts the accepted output to ≈ 85–90%. So the product figure is **75% raw → ~85–90% with grounding + verification + light review** — measured, not asserted, and never 100%.

Bottom line: verification works — it's one of the things we got right — provided you (i) use a capable verifier (a trained 2B suffices, which is cheap), and (ii) feed it grounded content. The honest ceiling is "strong filter," not "magic fixer."

13. The architecture that actually works

Grounded generation + verification + light review + orchestration:

Generate against real passages → eliminates fabrication and absurd contrasts; fixes length/authenticity (facts come from the source, not the parameters).
Verification pass re-judges each answer key → flags disagreements.
Regenerate / human-review the flagged minority.
Orchestrate full sections — one passage, then each question type against it, assembled — for real-exam-style output.

The fine-tuned model earns its keep as the self-contained generator; reliability comes from the system, not a single model.

14. Deployment / GGUF status (honest)

GGUF (for LM Studio / Ollama) was attempted on the server and hit a real toolchain wall — Qwen3.5 is a vision-language arch: the LoRA merge trips a torchvision::nms op, and convert_hf_to_gguf.py (transformers 5.x) errored before architecture detection. It IS achievable (lmstudio-community/Qwen3.5-*-GGUF exists), but needs a verified toolchain — we will not ship a subtly-broken GGUF into a public demo. Adapters are safe (checkpoints/ielts-v5-{2b,4b,9b}-cuda). Until verified, the deployable path is HF/transformers; the LM Studio CTA stays dark.

15. Anticipated objections (pre-answered)

"You only fixed/tested NOT GIVEN." §10 grades all 14 types; NG is a sub-issue of the hardest 2 of 14. The other 12 are strong.
"Eval is tiny/cherry-picked." 101 + 62 held-out real items, greedy, both agree; it's the harder old-curated set, not one tuned to flatter.
"Facts are made up." True for from-scratch passages, and size-dependent — which is why production grounds on real passages.
"Answer keys are wrong." Completion 100% verbatim (4B/9B); verdict ~8% logic-error rate (disclosed, caught by verification); MCQ defect is position, not correctness.
"All your MCQ answers are B." Correct (9B 7/7) — an answer-position bug, fixed at serving or shown only pre-checked.
"2B-beats-9B is noise." Greedy/deterministic, holds on both gold sets; and we don't overclaim — it's a 1–3 pt edge on one skill, and the 2B is weaker elsewhere.
"Fine-tuning made the 4B worse." Overall flat (73→74); it trades a little NG calibration. On a capable base the value is format/self-containment, not reasoning.
"It's not a foundation model." Agreed — and we say so everywhere; these are fine-tunes of open base models.

16. Caveats

The headline metric is a judging proxy on a held-out gold with mild label noise; the product generates, so judging is comparable but imperfect.
"Base already does NG well" is on Qwen3.5; it may not hold for other base families.
Generation grades are from k=3 (modest n); the defects reported are consistent across the three samples, but the exact percentages would tighten at higher k.
The dataset is private; this documents volumes and method only.

Appendix — repro pointers

Data: data_v5/{train(1438),valid(130)}.jsonl (600 Codex verdicts + curated non-verdict, capped ≤250/type) · assembled by scripts/assemble_v5.py.
Train: cloud/train_lora.py --model Qwen/Qwen3.5-{2B,4B,9B} --data ../data_v5 --epochs 2 --lr 1e-4 --max-seq 2048 (LoRA r16/α32, completion-only, thinking-off, bf16, val-min checkpoint).
Eval: cloud/eval_greedy.py --gold wide_gold_big.json (greedy verdict judging) · cloud/gen_samples.py --k 3 (generation).
Adapters: checkpoints/ielts-v5-{2b,4b,9b}-cuda/best-adapter.
Hardware: g5.2xlarge (A10G 24 GB) for 2B/4B; g6e.2xlarge (L40S 48 GB) for 9B; M4 Pro (24 GB, MLX) for local eval.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support