Instructions to use CRAAAAAAAAAA/Qwable3.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use CRAAAAAAAAAA/Qwable3.5-9B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="CRAAAAAAAAAA/Qwable3.5-9B")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("CRAAAAAAAAAA/Qwable3.5-9B", dtype="auto")

llama-cpp-python

How to use CRAAAAAAAAAA/Qwable3.5-9B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="CRAAAAAAAAAA/Qwable3.5-9B",
	filename="qwable3.5-9b-Q4_K_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use CRAAAAAAAAAA/Qwable3.5-9B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
# Run inference directly in the terminal:
llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Use Docker

docker model run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

LM Studio
Jan

vLLM

How to use CRAAAAAAAAAA/Qwable3.5-9B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "CRAAAAAAAAAA/Qwable3.5-9B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CRAAAAAAAAAA/Qwable3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

SGLang

How to use CRAAAAAAAAAA/Qwable3.5-9B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "CRAAAAAAAAAA/Qwable3.5-9B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CRAAAAAAAAAA/Qwable3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "CRAAAAAAAAAA/Qwable3.5-9B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "CRAAAAAAAAAA/Qwable3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use CRAAAAAAAAAA/Qwable3.5-9B with Ollama:
```
ollama run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
```

Unsloth Studio

How to use CRAAAAAAAAAA/Qwable3.5-9B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for CRAAAAAAAAAA/Qwable3.5-9B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for CRAAAAAAAAAA/Qwable3.5-9B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for CRAAAAAAAAAA/Qwable3.5-9B to start chatting

How to use CRAAAAAAAAAA/Qwable3.5-9B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use CRAAAAAAAAAA/Qwable3.5-9B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use CRAAAAAAAAAA/Qwable3.5-9B with Docker Model Runner:
```
docker model run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
```

Lemonade

How to use CRAAAAAAAAAA/Qwable3.5-9B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M

Run and chat with the model

lemonade run user.Qwable3.5-9B-Q4_K_M

List all available models

lemonade list

Qwable3.5-9B

A post-trained derivative of Qwen/Qwen3.5-9B, distilled from a strong commercial teacher model and aligned through a two-stage SFT (STaR) → GRPO pipeline.

Qwable3.5-9B is a 9B-parameter causal language model built on the Qwen3.5-9B foundation. It keeps the base model's hybrid Gated DeltaNet + Gated Attention architecture and native Multi-Token Prediction (MTP) head, and adds task-specialized behavior via supervised fine-tuning followed by reinforcement learning. It is released under the Apache 2.0 license.

Developed by: CRAAAAAAAAAA
Model type: Causal language model (decoder-only, hybrid linear + full attention)
Base model: Qwen/Qwen3.5-9B
Parameters: ~9B
Context length: 262,144 tokens (native), extensible to ~1M
Languages: English, French
License: Apache 2.0
Finetuned from: Qwen3.5-9B via distillation + SFT + GRPO

Model Description

Qwable3.5-9B was produced in three stages on top of the Qwen3.5-9B base:

Knowledge distillation from a strong commercial teacher model (not disclosed) into the Qwen3.5-9B student via chain-of-thought trace generation.
Supervised fine-tuning using a STaR (Self-Taught Reasoner) style loop to bootstrap and filter reasoning traces.
GRPO (Group Relative Policy Optimization) reinforcement learning with an execution-based correctness reward on code and math.

The base model's MTP head is preserved through the adapter-merge process, so the self-speculative decoding path remains available to downstream inference stacks that support it.

Intended use

Primary: Code generation (Python, algorithms), mathematical reasoning, instruction following, assistant chat in English and French.
Out of scope: High-stakes medical, legal, or financial decisions without human oversight; safety-critical systems; non-EN/FR languages (not characterized).

How to use

Transformers

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "CRAAAAAAAAAA/Qwable3.5-9B"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a Python function that checks if a number is prime."},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    inputs,
    max_new_tokens=2048,
    temperature=0.7,
    top_p=0.9,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))

llama.cpp (GGUF)

GGUF quantizations are available directly in this repo.

# Download Q4_K_M (recommended)
huggingface-cli download CRAAAAAAAAAA/Qwable3.5-9B qwable3.5-9b-Q4_K_M.gguf --local-dir .

# Run with optimized flags (52 tok/s on RTX 2060 6GB)
llama-server \
    -m qwable3.5-9b-Q4_K_M.gguf \
    -ngl 99 -c 2048 -np 1 \
    --flash-attn on \
    --cache-type-k q8_0 --cache-type-v q8_0 \
    --spec-type ngram-map-k \
    --host 127.0.0.1 --port 8080

Available quants

File	Quant	Size	Notes
`qwable3.5-9b-Q4_K_M.gguf`	Q4_K_M	~5.6 GB	recommended — fits 6 GB VRAM

Multi-Token Prediction / speculative decoding: the base architecture ships a trained MTP head usable as a built-in self-drafter on inference stacks that support it. For llama.cpp single-stream on 6 GB, --spec-type ngram-map-k (prompt-lookahead, zero extra VRAM) adds ~1 tok/s for free; an external draft model degrades throughput due to bandwidth contention.

Training pipeline

Stage 0 — Distillation

Teacher: strong commercial teacher model (not disclosed)
Distillation type: sequence-level — SFT on teacher-generated chain-of-thought traces
Distillation data: private synthetic dataset
Objective: SFT on teacher CoT outputs (code + math reasoning traces)

Stage 1 — Supervised fine-tuning (STaR)

Method: STaR (Self-Taught Reasoner) — bootstrap rationales, keep only traces that reach the correct answer, retrain.
Adapter: final_sft (LoRA, ~465 MiB)
Dataset: private reasoning dataset (code + math)
Frameworks: TRL, PEFT

Stage 2 — GRPO (reinforcement learning)

Method: Group Relative Policy Optimization (GRPO)
Adapter: final_grpo (LoRA, ~232 MiB)
Reward signal: execution-based correctness reward (code: unit tests; math: symbolic grader)
Prompt data: private code + math prompt set
Frameworks: TRL

Merge & export

Adapters were merged into the base in training order (base → SFT → GRPO), then exported to safetensors and converted to GGUF from the merged checkpoint to keep all formats consistent.

Frameworks: TRL, PEFT, llama.cpp (conversion + quantization)

Evaluation

Scores measured locally with greedy decoding (temperature=0) on the full test sets unless noted. Qwable3.5-9B was never fine-tuned on any benchmark test set — all post-training data was collected independently of HumanEval, MBPP, GSM8K, MATH, MGSM, AIME, or LiveCodeBench evaluation splits.

Benchmark	Metric	Qwable3.5-9B (GRPO)	Qwen3.5-9B (base)	Delta
HumanEval	pass@1	90.2% (148/164)	87.2%	+3.0 pp
MBPP	pass@1	84.4% (217/257)	82.5%	+1.9 pp
LiveCodeBench (global)	pass@1	32.0% (32/100)	29.0%	+3.0 pp
LiveCodeBench — easy	pass@1	100% (14/14)	—	—
LiveCodeBench — medium	pass@1	46.2% (12/26)	—	—
LiveCodeBench — hard	pass@1	10.0% (6/60)	—	—
GSM8K	acc	96%	96%	=
MGSM (fr)	acc	84%	82%	+2 pp
MATH Level 5	acc	70%	77.5%	−7.5 pp
AIME	pass@1	SFT: 53.3%	43.3%	+10 pp

Eval harness: custom scripts (llama.cpp llama-server v9637 + Python eval loop)
Decoding: temperature=0, greedy, max 512 tokens, thinking mode OFF
MATH Level 5 regression note: GRPO/SFT show a slight regression vs. base on competition-math; GSM8K and MGSM are unaffected. Likely a capacity trade-off from code specialization.

Limitations

MATH Level 5 regressed −7.5pp (77.5→70). Code specialization shifted capacity away from formal multi-step proofs. Real trade-off, not noise.
Inherited behavior: Qwable3.5-9B inherits the biases, knowledge cutoff, and failure modes of both Qwen/Qwen3.5-9B and the commercial teacher model.
Hallucination: like all LLMs, it can produce fluent but incorrect or fabricated content. Do not use outputs as authoritative without verification.
Domain scope: optimized for code generation and mathematical reasoning; performance on creative writing, general factual Q&A, or non-EN/FR languages is not characterized.
Safety: no dedicated safety fine-tuning or red-teaming has been performed beyond the base Qwen3.5-9B alignment.
Not for: high-stakes medical, legal, or financial decisions without human oversight.

License & attribution

This model is released under the Apache License 2.0.

It is a derivative of Qwen/Qwen3.5-9B, which is itself licensed under Apache 2.0. The original copyright and the base model's NOTICE (if any) are retained. You must preserve attribution and the license text when redistributing.

Distillation note: The teacher model used for distillation is not disclosed. Redistribution of this model assumes the teacher's license and terms permit using its outputs to train and openly release a derivative. If you reuse or further distill this model, verify that assumption for your use case.

Citation

@misc{qwable35_9b,
  title  = {Qwable3.5-9B},
  author = {CRAAAAAAAAAA},
  year   = {2026},
  url    = {https://huggingface.co/CRAAAAAAAAAA/Qwable3.5-9B}
}

Base model citation:

@misc{qwen3.5,
  title  = {{Qwen3.5}: Towards Native Multimodal Agents},
  author = {{Qwen Team}},
  month  = {February},
  year   = {2026},
  url    = {https://qwen.ai/blog?id=qwen3.5}
}

Downloads last month: 48

GGUF

Model size

9B params

Architecture

qwen35

Hardware compatibility

4-bit

Model tree for CRAAAAAAAAAA/Qwable3.5-9B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Finetuned

(392)

this model

Evaluation results

pass@1 on HumanEval
self-reported

90.240
pass@1 on MBPP
self-reported

84.440
accuracy on GSM8K
self-reported

96.000