cloze-reader-qwen3.5-0.8b-lora

A rank-16 LoRA adapter over Qwen/Qwen3.5-0.8B that replaces a ~33× larger teacher (Gemma-3-27B-IT) as the language-model backend for the Cloze Reader reading-comprehension app (source).

On a 200-example held-out test set, the fine-tuned 0.8B student matches or exceeds the 27B teacher on 4 of 5 metric categories, with perfect format-constraint compliance on two tasks. The full model at bf16 is ~1.7 GB; at Q4_K_M it runs in ~2.5–3 GB of VRAM.

TL;DR

This LoRA over Qwen3.5-0.8B Baseline Gemma-3-27B-IT
Word Selection — JSON valid 100.0% 94.0%
Batch Selection — all metrics 100.0% 96.0%
Hints — word safety 98.0% 100.0%
Contextualization — all metrics 100.0% 96.0–98.0%
Params (total) 0.8B + 25 MB adapter 27B
VRAM at Q4_K_M ~2.5–3 GB ~16 GB

What it does — the four tasks

The Cloze Reader app uses a single LM for four tightly-constrained text-generation tasks. The adapter was trained on examples of all four and produces the exact JSON / plaintext shapes the front-end expects.

  1. Word Selection. Given a passage, pick 1–3 vocabulary words to blank. Output is a JSON array such as ["laboratory", "synthesis"]. Constraints: lowercase only, 4–14 letters, must appear verbatim in the passage, no proper nouns.
  2. Batch Word Selection. Same task across two passages at once. Output is a JSON object with per-passage word lists plus supporting context.
  3. Contextual Hints. Given a blanked word and its sentence, return a 15–25-word Socratic hint that points at part of speech, sentence role, or semantic category without revealing the target word.
  4. Literary Contextualization. One-sentence insight about a passage (≤25 words, no em-dashes, no verbose preamble like "This passage is about…").

Exact prompt templates match aiService.js / conversationManager.js in the cloze-reader repo.


How to use

With peft + transformers

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

BASE = "Qwen/Qwen3.5-0.8B"
ADAPTER = "milwright/cloze-reader-qwen3.5-0.8b-lora"

tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)

messages = [
    {"role": "system", "content": "Select words for a cloze exercise. Return ONLY a JSON array of words, nothing else."},
    {"role": "user", "content": "Select 1 challenging words (4-14 letters) from this passage.\n\nPassage: \"...\""},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))

With Unsloth (4-bit, fast)

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    "milwright/cloze-reader-qwen3.5-0.8b-lora",
    max_seq_length=2048,
    load_in_4bit=True,
)
FastLanguageModel.for_inference(model)

With Ollama (via GGUF)

A merged GGUF build of this adapter fits ~2.5–3 GB at Q4_K_M and runs with:

ollama create cloze-reader -f Modelfile
ollama run cloze-reader

(The GGUF + Modelfile are not bundled here — see the project repo for the export pipeline.)


Training

Data

  • Source: 15,981 filtered conversation examples (from 19,341 raw → 82.6% pass rate), distilled from google/gemma-3-27b-it serving the production cloze-reader endpoint.
  • Passages: Randomly-sampled text windows from 40 classic public-domain books via the manu/project_gutenberg corpus (61k books on the Hub).
  • Per-task composition (post-filter):
    • Word Selection — 7,433 examples
    • Batch Selection — 6,822 examples
    • Contextual Hints — 902 examples
    • Literary Contextualization — 824 examples
  • Format: ShareGPT-style {"conversations": [{"role": ..., "content": ...}, …]}, rendered through Qwen-3 ChatML (<|im_start|> / <|im_end|>).
  • Filter gates: JSON parsability, lowercase-only word selection, word-in-passage check, hint-safety (no leakage of the answer word), length bounds, em-dash / preamble interdiction.

Procedure

Supervised fine-tuning with TRL + Unsloth's FastLanguageModel, loss masked to assistant turns via train_on_responses_only().

Hyperparameters

Setting Value
LoRA rank / alpha / dropout 16 / 16 / 0.05
Target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Base load precision 4-bit (bnb, nf4) via load_in_4bit=True
Training precision bfloat16
Optimizer adamw_8bit
Learning rate 2e-4, cosine schedule, 5% warmup
Weight decay 0.01
Per-device batch 4
Gradient accumulation 4
Max sequence length 2048
Epochs 3
Seed 42
Final training loss 0.625
Chat template qwen-3

Compute

  • Hardware: Single NVIDIA RTX 5090 Laptop GPU (24 GB VRAM).
  • Framework: Unsloth 2026.3.8 · TRL 0.24 · Transformers 5.3 · PEFT 0.18.1 · PyTorch 2.10 / CUDA 12.8.
  • Checkpoints saved: every epoch (checkpoint-1311, -2622, -3933).

Evaluation

200 held-out examples (50 per task, seed 42, removed from training data pre-split) scored against the live Gemma-3-27B endpoint as baseline.

Word Selection (n=50)

Metric This LoRA Gemma-3-27B Δ
JSON valid 100.0% 94.0% +6.0
Format OK 100.0% 94.0% +6.0
Words valid 88.0% 90.0% −2.0
Valid ratio 96.85% 98.72% −1.87

Batch Selection (n=50)

Metric This LoRA Gemma-3-27B Δ
JSON valid 100.0% 96.0% +4.0
Structure OK 100.0% 96.0% +4.0
Words present 100.0% 96.0% +4.0

Contextual Hints (n=50)

Metric This LoRA Gemma-3-27B Δ
Non-empty 100.0% 100.0%
Word safe (no leak) 98.0% 100.0% −2.0
Length OK (15–25w) 90.0% 94.0% −4.0
Mean word count 23.2 20.8 +2.4

Literary Contextualization (n=50)

Metric This LoRA Gemma-3-27B Δ
Non-empty 100.0% 98.0% +2.0
Length OK (≤25w) 100.0% 98.0% +2.0
No em-dashes 100.0% 98.0% +2.0
No preamble 100.0% 96.0% +4.0
Mean word count 19.0 16.9 +2.1

Summary

  • Format / JSON compliance: the 0.8B student beats the 27B teacher on every structural metric. This is the usual distillation win — shape constraints fit inside a small parameter budget.
  • Content quality: near-parity. Word-selection validity lags by 2 points; hint word-safety lags by 2 points. Neither gap blocks production use in the cloze app.
  • Throughput / cost: ~33× fewer parameters, runs locally on a laptop GPU, removes the 27B API dependency.

Full per-metric JSON is at evaluation_results.json in the training project repo.


Intended use

In-scope. Serving the four cloze-reader tasks in the Cloze Reader app or a comparable vocabulary-practice / guided-reading UI, where inputs are short English prose passages (classical or modern) and outputs are JSON arrays / objects or tightly-length-bounded sentences.

Out of scope.

  • Open-ended generation, chat, or reasoning — this adapter has only seen 4 narrow instruction templates and will generalize poorly outside them.
  • Languages other than English — training data is English-only.
  • Safety-critical or factual-lookup tasks — no alignment or factuality work was performed beyond format-filtering.
  • Multimodal inputs — although Qwen3.5-0.8B is a vision-language model, this adapter was trained on text conversations only.

Limitations and risks

  • Distilled from a single teacher. Failure modes of gemma-3-27b-it on the 4 task prompts are inherited. If the teacher has a blind spot on certain passages (e.g., archaic or dialect text from Gutenberg), the student has the same one.
  • Gutenberg domain skew. Passages are drawn from ~40 classic public-domain books. Modern prose, social media, and non-narrative text are under-represented.
  • Format compliance is not correctness. "100% JSON valid" means the output parses; it does not guarantee the selected words are the pedagogically best choice. Human review is advised for educational deployment.
  • Hint-leakage floor of 2%. 1 in 50 hints referenced or strongly implied the target word in testing. Downstream code should keep a heuristic safety filter in place.

License

This adapter is released under Apache 2.0, inheriting from the base model Qwen/Qwen3.5-0.8B. Training data includes public-domain passages from Project Gutenberg and AI-generated outputs from google/gemma-3-27b-it; redistribution of the adapter weights themselves carries no Gutenberg restriction, but downstream users should honor Gemma's terms if they redistribute teacher generations separately.

Project context

This is the production LM for Cloze Reader, a reading-comprehension web app for practicing vocabulary through contextual word-blanking. Originally the app called a hosted Gemma-3-27B endpoint; this adapter was trained to bring inference on-device and retire the API dependency.

Developed as part of milwright/quimbot, a broader fine-tuning and evaluation project for small English-language models. See the repo's CLAUDE.md and fine-tuning/ for the larger pipeline.

Citations

Base model — Qwen Team, Qwen3.5-0.8B (2026), huggingface.co/Qwen/Qwen3.5-0.8B.

Teacher (training data) — Google DeepMind, Gemma 3 27B Instruct (2025), huggingface.co/google/gemma-3-27b-it.

Passage corpusmanu/project_gutenberg.

TRL

@misc{vonwerra2022trl,
  title  = {{TRL: Transformer Reinforcement Learning}},
  author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Galloué́́dec},
  year   = 2020,
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/huggingface/trl}}
}

Unslothgithub.com/unslothai/unsloth.

PEFT — Mangrulkar et al., PEFT: State-of-the-art Parameter-Efficient Fine-Tuning, github.com/huggingface/peft.

Framework versions

  • PEFT 0.18.1
  • TRL 0.24.0
  • Transformers 5.3.0
  • PyTorch 2.10.0 + CUDA 12.8
  • Unsloth 2026.3.8
Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for milwright/cloze-reader-qwen3.5-0.8b-lora

Adapter
(118)
this model

Dataset used to train milwright/cloze-reader-qwen3.5-0.8b-lora

Evaluation results

  • JSON valid (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Format OK (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Words valid (%) on Cloze Reader held-out (n=50)
    self-reported
    88.000
  • Valid ratio (%) on Cloze Reader held-out (n=50)
    self-reported
    96.850
  • JSON valid (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Structure OK (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Words present (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Non-empty (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Word safe — no leak (%) on Cloze Reader held-out (n=50)
    self-reported
    98.000
  • Length OK 15–25 words (%) on Cloze Reader held-out (n=50)
    self-reported
    90.000
  • Non-empty (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • Length OK (≤25w) (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • No em-dashes (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000
  • No preamble (%) on Cloze Reader held-out (n=50)
    self-reported
    100.000