Qwen3.5-9B-OSS-Distilled

A reasoning-style distillation of Qwen/Qwen3.5-9B. The goal here was behavioral, not capability: stock Qwen3.5-9B frequently spirals on hard prompts — it wanders inside its <think> block and never terminates with an answer. This model was fine-tuned to adopt the tight, terminating reasoning style of openai/gpt-oss-20b, so that it reliably finishes reasoning and produces an answer.

Evaluation results

TL;DR

  • No-answer ("spiral-out") rate on a 400-prompt hard holdout: 36.2% → 0.5%.
  • On the 219 prompts where both models produced a usable answer, a blind A/B judge preferred this model 60.3% of the time (ties excluded).
  • This is a style fix. It does not add knowledge or raise the raw problem-solving ceiling.

Model details

  • Base model: Qwen/Qwen3.5-9B (Apache-2.0)
  • Teacher: openai/gpt-oss-20b (Apache-2.0)
  • Method: LoRA supervised fine-tuning (rank 16, alpha 16, bf16) with Unsloth, then merged into a standalone 16-bit model.
  • Training data: iAmBoosted/gpt-oss-20b-reasoning-traces — 3,333 filtered GPT-OSS-20B reasoning traces.
  • Language: English

Note on the base model. Qwen3.5-9B is a vision-language model. This distillation used text-only data and was evaluated on text-only prompts. Only the language/reasoning behavior was changed; any multimodal capability of the base is untested after fine-tuning and should not be relied on.

Intended use

Use it where you want Qwen3.5-9B-class reasoning that reliably terminates — math, science, code, and logic prompts that tend to make the stock model run away inside its reasoning. It is also a small, reproducible case study in reasoning-style distillation.

Out of scope: this is not a capability upgrade. It does not know more than the base model and should not be expected to beat it on tasks the base already handles well. Multimodal use is untested.

Evaluation

Evaluated on a 400-prompt held-out set drawn from the same sources as the training data. None of the held-out prompts were trained on.

Termination (the spiral fix)

Metric Stock Qwen3.5-9B Distilled
Answered (ok) 251 / 400 397 / 400
No answer (empty) 145 (36.2%) 2 (0.5%)
Truncated 4 (1.0%) 1 (0.2%)

Blind quality judgment

A blind, randomized A/B judge (a Gemma-class model, with no knowledge of which answer came from which model) compared the two models on the 251 prompts where both produced a usable answer; 219 pairs were scored.

Outcome Count Share
Distilled preferred 105 47.9%
Tie 45 20.5%
Baseline preferred 69 31.5%

Ties excluded, the distilled model was preferred in 60.3% of decided pairs.

Domain Distilled / Tie / Baseline
physics 29 / 0 / 12
biology 26 / 0 / 14
chemistry 25 / 3 / 18
code 16 / 4 / 12
math 7 / 23 / 7
puzzle 2 / 15 / 6

Limitations

  • Style, not capability. The win is reliable termination and a cleaner reasoning style — not new knowledge or higher raw accuracy.
  • Puzzle domain. On puzzle prompts the baseline was actually preferred (6 decided pairs vs 2). The tighter reasoning style appears to trim the exploratory wandering that some puzzles benefit from.
  • Math is roughly even (7 / 7, with 23 ties) — distillation neither clearly helped nor hurt math quality.
  • The judge was an LLM and was not human-validated. Treat the 60.3% as indicative, not definitive.
  • Coverage. Evaluation is a single 400-prompt holdout; ~30 of the 251 comparable pairs were dropped due to API/parse failures during judging.
  • Multimodal behavior is untested (see the note above).

How to use

Load and run it exactly as you would the Qwen3.5-9B base model — this is a standard merged fine-tune. Qwen3.5 requires a recent transformers (and a recent vLLM if you serve it that way); see the base model card for the current version requirements and the canonical loading snippet.

License & attribution

Released under Apache-2.0, inherited from the Qwen3.5-9B base. Teacher outputs come from GPT-OSS-20B (Apache-2.0). Built with Unsloth. Training prompts derive from several open datasets with mixed licenses — see the dataset card for full source attribution and licensing.

Downloads last month
92
Safetensors
Model size
10B params
Tensor type
BF16
·
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for iAmBoosted/Qwen3.5-9B-OSS-Distilled

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(355)
this model

Dataset used to train iAmBoosted/Qwen3.5-9B-OSS-Distilled