cloze-reader-qwen3.5-0.8b-lora

A rank-16 LoRA adapter over Qwen/Qwen3.5-0.8B that replaces a ~33× larger teacher (Gemma-3-27B-IT) as the language-model backend for the Cloze Reader reading-comprehension app (source).

On a 200-example held-out test set, the fine-tuned 0.8B student matches or exceeds the 27B teacher on 4 of 5 metric categories, with perfect format-constraint compliance on two tasks. The full model at bf16 is ~1.7 GB; at Q4_K_M it runs in ~2.5–3 GB of VRAM.

TL;DR

	This LoRA over Qwen3.5-0.8B	Baseline Gemma-3-27B-IT
Word Selection — JSON valid	100.0%	94.0%
Batch Selection — all metrics	100.0%	96.0%
Hints — word safety	98.0%	100.0%
Contextualization — all metrics	100.0%	96.0–98.0%
Params (total)	0.8B + 25 MB adapter	27B
VRAM at Q4_K_M	~2.5–3 GB	~16 GB

What it does — the four tasks

The Cloze Reader app uses a single LM for four tightly-constrained text-generation tasks. The adapter was trained on examples of all four and produces the exact JSON / plaintext shapes the front-end expects.

Word Selection. Given a passage, pick 1–3 vocabulary words to blank. Output is a JSON array such as ["laboratory", "synthesis"]. Constraints: lowercase only, 4–14 letters, must appear verbatim in the passage, no proper nouns.
Batch Word Selection. Same task across two passages at once. Output is a JSON object with per-passage word lists plus supporting context.
Contextual Hints. Given a blanked word and its sentence, return a 15–25-word Socratic hint that points at part of speech, sentence role, or semantic category without revealing the target word.
Literary Contextualization. One-sentence insight about a passage (≤25 words, no em-dashes, no verbose preamble like "This passage is about…").

Exact prompt templates match aiService.js / conversationManager.js in the cloze-reader repo.

How to use

With `peft` + `transformers`

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE = "Qwen/Qwen3.5-0.8B"
ADAPTER = "milwright/cloze-reader-qwen3.5-0.8b-lora"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)

messages = [
    {"role": "system", "content": "Select words for a cloze exercise. Return ONLY a JSON array of words, nothing else."},
    {"role": "user", "content": "Select 1 challenging words (4-14 letters) from this passage.\n\nPassage: \"...\""},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))

With Unsloth (4-bit, fast)

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "milwright/cloze-reader-qwen3.5-0.8b-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

With Ollama (via GGUF)

A merged GGUF build of this adapter fits ~2.5–3 GB at Q4_K_M and runs with:

ollama create cloze-reader -f Modelfile
ollama run cloze-reader

(The GGUF + Modelfile are not bundled here — see the project repo for the export pipeline.)

Training

Data

Source: 15,981 filtered conversation examples (from 19,341 raw → 82.6% pass rate), distilled from google/gemma-3-27b-it serving the production cloze-reader endpoint.
Passages: Randomly-sampled text windows from 40 classic public-domain books via the manu/project_gutenberg corpus (61k books on the Hub).
Per-task composition (post-filter):
- Word Selection — 7,433 examples
- Batch Selection — 6,822 examples
- Contextual Hints — 902 examples
- Literary Contextualization — 824 examples
Format: ShareGPT-style {"conversations": [{"role": ..., "content": ...}, …]}, rendered through Qwen-3 ChatML (<|im_start|> / <|im_end|>).
Filter gates: JSON parsability, lowercase-only word selection, word-in-passage check, hint-safety (no leakage of the answer word), length bounds, em-dash / preamble interdiction.

Procedure

Supervised fine-tuning with TRL + Unsloth's FastLanguageModel, loss masked to assistant turns via train_on_responses_only().

Hyperparameters

Setting	Value
LoRA rank / alpha / dropout	16 / 16 / 0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`, `gate_proj`, `up_proj`, `down_proj`
Base load precision	4-bit (bnb, nf4) via `load_in_4bit=True`
Training precision	bfloat16
Optimizer	`adamw_8bit`
Learning rate	2e-4, cosine schedule, 5% warmup
Weight decay	0.01
Per-device batch	4
Gradient accumulation	4
Max sequence length	2048
Epochs	3
Seed	42
Final training loss	0.625
Chat template	`qwen-3`

Compute

Hardware: Single NVIDIA RTX 5090 Laptop GPU (24 GB VRAM).
Framework: Unsloth 2026.3.8 · TRL 0.24 · Transformers 5.3 · PEFT 0.18.1 · PyTorch 2.10 / CUDA 12.8.
Checkpoints saved: every epoch (checkpoint-1311, -2622, -3933).

Evaluation

200 held-out examples (50 per task, seed 42, removed from training data pre-split) scored against the live Gemma-3-27B endpoint as baseline.

Word Selection (n=50)

Metric	This LoRA	Gemma-3-27B	Δ
JSON valid	100.0%	94.0%	+6.0
Format OK	100.0%	94.0%	+6.0
Words valid	88.0%	90.0%	−2.0
Valid ratio	96.85%	98.72%	−1.87

Batch Selection (n=50)

Metric	This LoRA	Gemma-3-27B	Δ
JSON valid	100.0%	96.0%	+4.0
Structure OK	100.0%	96.0%	+4.0
Words present	100.0%	96.0%	+4.0

Contextual Hints (n=50)

Metric	This LoRA	Gemma-3-27B	Δ
Non-empty	100.0%	100.0%	—
Word safe (no leak)	98.0%	100.0%	−2.0
Length OK (15–25w)	90.0%	94.0%	−4.0
Mean word count	23.2	20.8	+2.4

Literary Contextualization (n=50)

Metric	This LoRA	Gemma-3-27B	Δ
Non-empty	100.0%	98.0%	+2.0
Length OK (≤25w)	100.0%	98.0%	+2.0
No em-dashes	100.0%	98.0%	+2.0
No preamble	100.0%	96.0%	+4.0
Mean word count	19.0	16.9	+2.1

Summary

Format / JSON compliance: the 0.8B student beats the 27B teacher on every structural metric. This is the usual distillation win — shape constraints fit inside a small parameter budget.
Content quality: near-parity. Word-selection validity lags by 2 points; hint word-safety lags by 2 points. Neither gap blocks production use in the cloze app.
Throughput / cost: ~33× fewer parameters, runs locally on a laptop GPU, removes the 27B API dependency.

Full per-metric JSON is at evaluation_results.json in the training project repo.

Intended use

In-scope. Serving the four cloze-reader tasks in the Cloze Reader app or a comparable vocabulary-practice / guided-reading UI, where inputs are short English prose passages (classical or modern) and outputs are JSON arrays / objects or tightly-length-bounded sentences.

Out of scope.

Open-ended generation, chat, or reasoning — this adapter has only seen 4 narrow instruction templates and will generalize poorly outside them.
Languages other than English — training data is English-only.
Safety-critical or factual-lookup tasks — no alignment or factuality work was performed beyond format-filtering.
Multimodal inputs — although Qwen3.5-0.8B is a vision-language model, this adapter was trained on text conversations only.

Limitations and risks

Distilled from a single teacher. Failure modes of gemma-3-27b-it on the 4 task prompts are inherited. If the teacher has a blind spot on certain passages (e.g., archaic or dialect text from Gutenberg), the student has the same one.
Gutenberg domain skew. Passages are drawn from ~40 classic public-domain books. Modern prose, social media, and non-narrative text are under-represented.
Format compliance is not correctness. "100% JSON valid" means the output parses; it does not guarantee the selected words are the pedagogically best choice. Human review is advised for educational deployment.
Hint-leakage floor of 2%. 1 in 50 hints referenced or strongly implied the target word in testing. Downstream code should keep a heuristic safety filter in place.

License

This adapter is released under Apache 2.0, inheriting from the base model Qwen/Qwen3.5-0.8B. Training data includes public-domain passages from Project Gutenberg and AI-generated outputs from google/gemma-3-27b-it; redistribution of the adapter weights themselves carries no Gutenberg restriction, but downstream users should honor Gemma's terms if they redistribute teacher generations separately.

Project context

This is the production LM for Cloze Reader, a reading-comprehension web app for practicing vocabulary through contextual word-blanking. Originally the app called a hosted Gemma-3-27B endpoint; this adapter was trained to bring inference on-device and retire the API dependency.

Developed as part of milwright/quimbot, a broader fine-tuning and evaluation project for small English-language models. See the repo's CLAUDE.md and fine-tuning/ for the larger pipeline.

Citations

Base model — Qwen Team, Qwen3.5-0.8B (2026), huggingface.co/Qwen/Qwen3.5-0.8B.

Teacher (training data) — Google DeepMind, Gemma 3 27B Instruct (2025), huggingface.co/google/gemma-3-27b-it.

Passage corpus — manu/project_gutenberg.

TRL

@misc{vonwerra2022trl,
  title  = {{TRL: Transformer Reinforcement Learning}},
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Galloué́́dec},
  year   = 2020,
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

Unsloth — github.com/unslothai/unsloth.

PEFT — Mangrulkar et al., PEFT: State-of-the-art Parameter-Efficient Fine-Tuning, github.com/huggingface/peft.

Framework versions

PEFT 0.18.1
TRL 0.24.0
Transformers 5.3.0
PyTorch 2.10.0 + CUDA 12.8
Unsloth 2026.3.8

Downloads last month: 2

Model tree for milwright/cloze-reader-qwen3.5-0.8b-lora

Base model

Qwen/Qwen3.5-0.8B-Base

Finetuned

Qwen/Qwen3.5-0.8B

Adapter

(118)

this model

Dataset used to train milwright/cloze-reader-qwen3.5-0.8b-lora

Evaluation results

JSON valid (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Format OK (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Words valid (%) on Cloze Reader held-out (n=50)
self-reported

88.000
Valid ratio (%) on Cloze Reader held-out (n=50)
self-reported

96.850
JSON valid (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Structure OK (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Words present (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Non-empty (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Word safe — no leak (%) on Cloze Reader held-out (n=50)
self-reported

98.000
Length OK 15–25 words (%) on Cloze Reader held-out (n=50)
self-reported

90.000
Non-empty (%) on Cloze Reader held-out (n=50)
self-reported

100.000
Length OK (≤25w) (%) on Cloze Reader held-out (n=50)
self-reported

100.000
No em-dashes (%) on Cloze Reader held-out (n=50)
self-reported

100.000
No preamble (%) on Cloze Reader held-out (n=50)
self-reported

100.000