Instructions to use milwright/cloze-reader-qwen3.5-0.8b-lora with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use milwright/cloze-reader-qwen3.5-0.8b-lora with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-0.8B") model = PeftModel.from_pretrained(base_model, "milwright/cloze-reader-qwen3.5-0.8b-lora") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- Unsloth Studio new
How to use milwright/cloze-reader-qwen3.5-0.8b-lora with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for milwright/cloze-reader-qwen3.5-0.8b-lora to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for milwright/cloze-reader-qwen3.5-0.8b-lora to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for milwright/cloze-reader-qwen3.5-0.8b-lora to start chatting
Load model with FastModel
pip install unsloth from unsloth import FastModel model, tokenizer = FastModel.from_pretrained( model_name="milwright/cloze-reader-qwen3.5-0.8b-lora", max_seq_length=2048, )
cloze-reader-qwen3.5-0.8b-lora
A rank-16 LoRA adapter over Qwen/Qwen3.5-0.8B that replaces a ~33× larger teacher (Gemma-3-27B-IT) as the language-model backend for the Cloze Reader reading-comprehension app (source).
On a 200-example held-out test set, the fine-tuned 0.8B student matches or exceeds the 27B teacher on 4 of 5 metric categories, with perfect format-constraint compliance on two tasks. The full model at bf16 is ~1.7 GB; at Q4_K_M it runs in ~2.5–3 GB of VRAM.
TL;DR
| This LoRA over Qwen3.5-0.8B | Baseline Gemma-3-27B-IT | |
|---|---|---|
| Word Selection — JSON valid | 100.0% | 94.0% |
| Batch Selection — all metrics | 100.0% | 96.0% |
| Hints — word safety | 98.0% | 100.0% |
| Contextualization — all metrics | 100.0% | 96.0–98.0% |
| Params (total) | 0.8B + 25 MB adapter | 27B |
| VRAM at Q4_K_M | ~2.5–3 GB | ~16 GB |
What it does — the four tasks
The Cloze Reader app uses a single LM for four tightly-constrained text-generation tasks. The adapter was trained on examples of all four and produces the exact JSON / plaintext shapes the front-end expects.
- Word Selection. Given a passage, pick 1–3 vocabulary words to blank. Output is a JSON array such as
["laboratory", "synthesis"]. Constraints: lowercase only, 4–14 letters, must appear verbatim in the passage, no proper nouns. - Batch Word Selection. Same task across two passages at once. Output is a JSON object with per-passage word lists plus supporting context.
- Contextual Hints. Given a blanked word and its sentence, return a 15–25-word Socratic hint that points at part of speech, sentence role, or semantic category without revealing the target word.
- Literary Contextualization. One-sentence insight about a passage (≤25 words, no em-dashes, no verbose preamble like "This passage is about…").
Exact prompt templates match aiService.js / conversationManager.js in the cloze-reader repo.
How to use
With peft + transformers
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
BASE = "Qwen/Qwen3.5-0.8B"
ADAPTER = "milwright/cloze-reader-qwen3.5-0.8b-lora"
tokenizer = AutoTokenizer.from_pretrained(ADAPTER)
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
messages = [
{"role": "system", "content": "Select words for a cloze exercise. Return ONLY a JSON array of words, nothing else."},
{"role": "user", "content": "Select 1 challenging words (4-14 letters) from this passage.\n\nPassage: \"...\""},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tokenizer(prompt, return_tensors="pt").to(model.device), max_new_tokens=64)
print(tokenizer.decode(out[0], skip_special_tokens=True))
With Unsloth (4-bit, fast)
from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
"milwright/cloze-reader-qwen3.5-0.8b-lora",
max_seq_length=2048,
load_in_4bit=True,
)
FastLanguageModel.for_inference(model)
With Ollama (via GGUF)
A merged GGUF build of this adapter fits ~2.5–3 GB at Q4_K_M and runs with:
ollama create cloze-reader -f Modelfile
ollama run cloze-reader
(The GGUF + Modelfile are not bundled here — see the project repo for the export pipeline.)
Training
Data
- Source: 15,981 filtered conversation examples (from 19,341 raw → 82.6% pass rate), distilled from
google/gemma-3-27b-itserving the production cloze-reader endpoint. - Passages: Randomly-sampled text windows from 40 classic public-domain books via the
manu/project_gutenbergcorpus (61k books on the Hub). - Per-task composition (post-filter):
- Word Selection — 7,433 examples
- Batch Selection — 6,822 examples
- Contextual Hints — 902 examples
- Literary Contextualization — 824 examples
- Format: ShareGPT-style
{"conversations": [{"role": ..., "content": ...}, …]}, rendered through Qwen-3 ChatML (<|im_start|>/<|im_end|>). - Filter gates: JSON parsability, lowercase-only word selection, word-in-passage check, hint-safety (no leakage of the answer word), length bounds, em-dash / preamble interdiction.
Procedure
Supervised fine-tuning with TRL + Unsloth's FastLanguageModel, loss masked to assistant turns via train_on_responses_only().
Hyperparameters
| Setting | Value |
|---|---|
| LoRA rank / alpha / dropout | 16 / 16 / 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Base load precision | 4-bit (bnb, nf4) via load_in_4bit=True |
| Training precision | bfloat16 |
| Optimizer | adamw_8bit |
| Learning rate | 2e-4, cosine schedule, 5% warmup |
| Weight decay | 0.01 |
| Per-device batch | 4 |
| Gradient accumulation | 4 |
| Max sequence length | 2048 |
| Epochs | 3 |
| Seed | 42 |
| Final training loss | 0.625 |
| Chat template | qwen-3 |
Compute
- Hardware: Single NVIDIA RTX 5090 Laptop GPU (24 GB VRAM).
- Framework: Unsloth 2026.3.8 · TRL 0.24 · Transformers 5.3 · PEFT 0.18.1 · PyTorch 2.10 / CUDA 12.8.
- Checkpoints saved: every epoch (
checkpoint-1311,-2622,-3933).
Evaluation
200 held-out examples (50 per task, seed 42, removed from training data pre-split) scored against the live Gemma-3-27B endpoint as baseline.
Word Selection (n=50)
| Metric | This LoRA | Gemma-3-27B | Δ |
|---|---|---|---|
| JSON valid | 100.0% | 94.0% | +6.0 |
| Format OK | 100.0% | 94.0% | +6.0 |
| Words valid | 88.0% | 90.0% | −2.0 |
| Valid ratio | 96.85% | 98.72% | −1.87 |
Batch Selection (n=50)
| Metric | This LoRA | Gemma-3-27B | Δ |
|---|---|---|---|
| JSON valid | 100.0% | 96.0% | +4.0 |
| Structure OK | 100.0% | 96.0% | +4.0 |
| Words present | 100.0% | 96.0% | +4.0 |
Contextual Hints (n=50)
| Metric | This LoRA | Gemma-3-27B | Δ |
|---|---|---|---|
| Non-empty | 100.0% | 100.0% | — |
| Word safe (no leak) | 98.0% | 100.0% | −2.0 |
| Length OK (15–25w) | 90.0% | 94.0% | −4.0 |
| Mean word count | 23.2 | 20.8 | +2.4 |
Literary Contextualization (n=50)
| Metric | This LoRA | Gemma-3-27B | Δ |
|---|---|---|---|
| Non-empty | 100.0% | 98.0% | +2.0 |
| Length OK (≤25w) | 100.0% | 98.0% | +2.0 |
| No em-dashes | 100.0% | 98.0% | +2.0 |
| No preamble | 100.0% | 96.0% | +4.0 |
| Mean word count | 19.0 | 16.9 | +2.1 |
Summary
- Format / JSON compliance: the 0.8B student beats the 27B teacher on every structural metric. This is the usual distillation win — shape constraints fit inside a small parameter budget.
- Content quality: near-parity. Word-selection validity lags by 2 points; hint word-safety lags by 2 points. Neither gap blocks production use in the cloze app.
- Throughput / cost: ~33× fewer parameters, runs locally on a laptop GPU, removes the 27B API dependency.
Full per-metric JSON is at evaluation_results.json in the training project repo.
Intended use
In-scope. Serving the four cloze-reader tasks in the Cloze Reader app or a comparable vocabulary-practice / guided-reading UI, where inputs are short English prose passages (classical or modern) and outputs are JSON arrays / objects or tightly-length-bounded sentences.
Out of scope.
- Open-ended generation, chat, or reasoning — this adapter has only seen 4 narrow instruction templates and will generalize poorly outside them.
- Languages other than English — training data is English-only.
- Safety-critical or factual-lookup tasks — no alignment or factuality work was performed beyond format-filtering.
- Multimodal inputs — although Qwen3.5-0.8B is a vision-language model, this adapter was trained on text conversations only.
Limitations and risks
- Distilled from a single teacher. Failure modes of
gemma-3-27b-iton the 4 task prompts are inherited. If the teacher has a blind spot on certain passages (e.g., archaic or dialect text from Gutenberg), the student has the same one. - Gutenberg domain skew. Passages are drawn from ~40 classic public-domain books. Modern prose, social media, and non-narrative text are under-represented.
- Format compliance is not correctness. "100% JSON valid" means the output parses; it does not guarantee the selected words are the pedagogically best choice. Human review is advised for educational deployment.
- Hint-leakage floor of 2%. 1 in 50 hints referenced or strongly implied the target word in testing. Downstream code should keep a heuristic safety filter in place.
License
This adapter is released under Apache 2.0, inheriting from the base model Qwen/Qwen3.5-0.8B. Training data includes public-domain passages from Project Gutenberg and AI-generated outputs from google/gemma-3-27b-it; redistribution of the adapter weights themselves carries no Gutenberg restriction, but downstream users should honor Gemma's terms if they redistribute teacher generations separately.
Project context
This is the production LM for Cloze Reader, a reading-comprehension web app for practicing vocabulary through contextual word-blanking. Originally the app called a hosted Gemma-3-27B endpoint; this adapter was trained to bring inference on-device and retire the API dependency.
Developed as part of milwright/quimbot, a broader fine-tuning and evaluation project for small English-language models. See the repo's CLAUDE.md and fine-tuning/ for the larger pipeline.
Citations
Base model — Qwen Team, Qwen3.5-0.8B (2026), huggingface.co/Qwen/Qwen3.5-0.8B.
Teacher (training data) — Google DeepMind, Gemma 3 27B Instruct (2025), huggingface.co/google/gemma-3-27b-it.
Passage corpus — manu/project_gutenberg.
TRL
@misc{vonwerra2022trl,
title = {{TRL: Transformer Reinforcement Learning}},
author = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Galloué́́dec},
year = 2020,
journal = {GitHub repository},
howpublished = {\url{https://github.com/huggingface/trl}}
}
Unsloth — github.com/unslothai/unsloth.
PEFT — Mangrulkar et al., PEFT: State-of-the-art Parameter-Efficient Fine-Tuning, github.com/huggingface/peft.
Framework versions
- PEFT 0.18.1
- TRL 0.24.0
- Transformers 5.3.0
- PyTorch 2.10.0 + CUDA 12.8
- Unsloth 2026.3.8
- Downloads last month
- 2
Model tree for milwright/cloze-reader-qwen3.5-0.8b-lora
Dataset used to train milwright/cloze-reader-qwen3.5-0.8b-lora
Evaluation results
- JSON valid (%) on Cloze Reader held-out (n=50)self-reported100.000
- Format OK (%) on Cloze Reader held-out (n=50)self-reported100.000
- Words valid (%) on Cloze Reader held-out (n=50)self-reported88.000
- Valid ratio (%) on Cloze Reader held-out (n=50)self-reported96.850
- JSON valid (%) on Cloze Reader held-out (n=50)self-reported100.000
- Structure OK (%) on Cloze Reader held-out (n=50)self-reported100.000
- Words present (%) on Cloze Reader held-out (n=50)self-reported100.000
- Non-empty (%) on Cloze Reader held-out (n=50)self-reported100.000
- Word safe — no leak (%) on Cloze Reader held-out (n=50)self-reported98.000
- Length OK 15–25 words (%) on Cloze Reader held-out (n=50)self-reported90.000
- Non-empty (%) on Cloze Reader held-out (n=50)self-reported100.000
- Length OK (≤25w) (%) on Cloze Reader held-out (n=50)self-reported100.000
- No em-dashes (%) on Cloze Reader held-out (n=50)self-reported100.000
- No preamble (%) on Cloze Reader held-out (n=50)self-reported100.000