How to use from the
Use from the
llama-cpp-python library
# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="CRAAAAAAAAAA/Qwable3.5-9B",
	filename="qwable3.5-9b-Q4_K_M.gguf",
)
llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Qwable3.5-9B

A post-trained derivative of Qwen/Qwen3.5-9B, distilled from a strong commercial teacher model and aligned through a two-stage SFT (STaR) → GRPO pipeline.

Qwable3.5-9B is a 9B-parameter causal language model built on the Qwen3.5-9B foundation. It keeps the base model's hybrid Gated DeltaNet + Gated Attention architecture and native Multi-Token Prediction (MTP) head, and adds task-specialized behavior via supervised fine-tuning followed by reinforcement learning. It is released under the Apache 2.0 license.

  • Developed by: CRAAAAAAAAAA
  • Model type: Causal language model (decoder-only, hybrid linear + full attention)
  • Base model: Qwen/Qwen3.5-9B
  • Parameters: ~9B
  • Context length: 262,144 tokens (native), extensible to ~1M
  • Languages: English, French
  • License: Apache 2.0
  • Finetuned from: Qwen3.5-9B via distillation + SFT + GRPO

Model Description

Qwable3.5-9B was produced in three stages on top of the Qwen3.5-9B base:

  1. Knowledge distillation from a strong commercial teacher model (not disclosed) into the Qwen3.5-9B student via chain-of-thought trace generation.
  2. Supervised fine-tuning using a STaR (Self-Taught Reasoner) style loop to bootstrap and filter reasoning traces.
  3. GRPO (Group Relative Policy Optimization) reinforcement learning with an execution-based correctness reward on code and math.

The base model's MTP head is preserved through the adapter-merge process, so the self-speculative decoding path remains available to downstream inference stacks that support it.

Intended use

  • Primary: Code generation (Python, algorithms), mathematical reasoning, instruction following, assistant chat in English and French.
  • Out of scope: High-stakes medical, legal, or financial decisions without human oversight; safety-critical systems; non-EN/FR languages (not characterized).

How to use

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "CRAAAAAAAAAA/Qwable3.5-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a Python function that checks if a number is prime."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

llama.cpp (GGUF)

GGUF quantizations are available directly in this repo.

# Download Q4_K_M (recommended)
huggingface-cli download CRAAAAAAAAAA/Qwable3.5-9B qwable3.5-9b-Q4_K_M.gguf --local-dir .

# Run with optimized flags (52 tok/s on RTX 2060 6GB)
llama-server \
    -m qwable3.5-9b-Q4_K_M.gguf \
    -ngl 99 -c 2048 -np 1 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --spec-type ngram-map-k \
    --host 127.0.0.1 --port 8080

Available quants

File Quant Size Notes
qwable3.5-9b-Q4_K_M.gguf Q4_K_M ~5.6 GB recommended — fits 6 GB VRAM

Multi-Token Prediction / speculative decoding: the base architecture ships a trained MTP head usable as a built-in self-drafter on inference stacks that support it. For llama.cpp single-stream on 6 GB, --spec-type ngram-map-k (prompt-lookahead, zero extra VRAM) adds ~1 tok/s for free; an external draft model degrades throughput due to bandwidth contention.


Training pipeline

Stage 0 — Distillation

  • Teacher: strong commercial teacher model (not disclosed)
  • Distillation type: sequence-level — SFT on teacher-generated chain-of-thought traces
  • Distillation data: private synthetic dataset
  • Objective: SFT on teacher CoT outputs (code + math reasoning traces)

Stage 1 — Supervised fine-tuning (STaR)

  • Method: STaR (Self-Taught Reasoner) — bootstrap rationales, keep only traces that reach the correct answer, retrain.
  • Adapter: final_sft (LoRA, ~465 MiB)
  • Dataset: private reasoning dataset (code + math)
  • Frameworks: TRL, PEFT

Stage 2 — GRPO (reinforcement learning)

  • Method: Group Relative Policy Optimization (GRPO)
  • Adapter: final_grpo (LoRA, ~232 MiB)
  • Reward signal: execution-based correctness reward (code: unit tests; math: symbolic grader)
  • Prompt data: private code + math prompt set
  • Frameworks: TRL

Merge & export

Adapters were merged into the base in training order (base → SFT → GRPO), then exported to safetensors and converted to GGUF from the merged checkpoint to keep all formats consistent.

  • Frameworks: TRL, PEFT, llama.cpp (conversion + quantization)

Evaluation

Scores measured locally with greedy decoding (temperature=0) on the full test sets unless noted. Qwable3.5-9B was never fine-tuned on any benchmark test set — all post-training data was collected independently of HumanEval, MBPP, GSM8K, MATH, MGSM, AIME, or LiveCodeBench evaluation splits.

Benchmark Metric Qwable3.5-9B (GRPO) Qwen3.5-9B (base) Delta
HumanEval pass@1 90.2% (148/164) 87.2% +3.0 pp
MBPP pass@1 84.4% (217/257) 82.5% +1.9 pp
LiveCodeBench (global) pass@1 32.0% (32/100) 29.0% +3.0 pp
LiveCodeBench — easy pass@1 100% (14/14)
LiveCodeBench — medium pass@1 46.2% (12/26)
LiveCodeBench — hard pass@1 10.0% (6/60)
GSM8K acc 96% 96% =
MGSM (fr) acc 84% 82% +2 pp
MATH Level 5 acc 70% 77.5% −7.5 pp
AIME pass@1 SFT: 53.3% 43.3% +10 pp
  • Eval harness: custom scripts (llama.cpp llama-server v9637 + Python eval loop)
  • Decoding: temperature=0, greedy, max 512 tokens, thinking mode OFF
  • MATH Level 5 regression note: GRPO/SFT show a slight regression vs. base on competition-math; GSM8K and MGSM are unaffected. Likely a capacity trade-off from code specialization.

Limitations

  • MATH Level 5 regressed −7.5pp (77.5→70). Code specialization shifted capacity away from formal multi-step proofs. Real trade-off, not noise.
  • Inherited behavior: Qwable3.5-9B inherits the biases, knowledge cutoff, and failure modes of both Qwen/Qwen3.5-9B and the commercial teacher model.
  • Hallucination: like all LLMs, it can produce fluent but incorrect or fabricated content. Do not use outputs as authoritative without verification.
  • Domain scope: optimized for code generation and mathematical reasoning; performance on creative writing, general factual Q&A, or non-EN/FR languages is not characterized.
  • Safety: no dedicated safety fine-tuning or red-teaming has been performed beyond the base Qwen3.5-9B alignment.
  • Not for: high-stakes medical, legal, or financial decisions without human oversight.

License & attribution

This model is released under the Apache License 2.0.

It is a derivative of Qwen/Qwen3.5-9B, which is itself licensed under Apache 2.0. The original copyright and the base model's NOTICE (if any) are retained. You must preserve attribution and the license text when redistributing.

Distillation note: The teacher model used for distillation is not disclosed. Redistribution of this model assumes the teacher's license and terms permit using its outputs to train and openly release a derivative. If you reuse or further distill this model, verify that assumption for your use case.

Citation

@misc{qwable35_9b,
  title  = {Qwable3.5-9B},
  author = {CRAAAAAAAAAA},
  year   = {2026},
  url    = {https://huggingface.co/CRAAAAAAAAAA/Qwable3.5-9B}
}

Base model citation:

@misc{qwen3.5,
  title  = {{Qwen3.5}: Towards Native Multimodal Agents},
  author = {{Qwen Team}},
  month  = {February},
  year   = {2026},
  url    = {https://qwen.ai/blog?id=qwen3.5}
}
Downloads last month
48
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for CRAAAAAAAAAA/Qwable3.5-9B

Finetuned
Qwen/Qwen3.5-9B
Finetuned
(392)
this model

Evaluation results