Instructions to use CRAAAAAAAAAA/Qwable3.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use CRAAAAAAAAAA/Qwable3.5-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="CRAAAAAAAAAA/Qwable3.5-9B") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("CRAAAAAAAAAA/Qwable3.5-9B", dtype="auto") - llama-cpp-python
How to use CRAAAAAAAAAA/Qwable3.5-9B with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CRAAAAAAAAAA/Qwable3.5-9B", filename="qwable3.5-9b-Q4_K_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": "What is the capital of France?" } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use CRAAAAAAAAAA/Qwable3.5-9B with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M # Run inference directly in the terminal: llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M # Run inference directly in the terminal: ./llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Use Docker
docker model run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
- LM Studio
- Jan
- vLLM
How to use CRAAAAAAAAAA/Qwable3.5-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "CRAAAAAAAAAA/Qwable3.5-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CRAAAAAAAAAA/Qwable3.5-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
- SGLang
How to use CRAAAAAAAAAA/Qwable3.5-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "CRAAAAAAAAAA/Qwable3.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CRAAAAAAAAAA/Qwable3.5-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "CRAAAAAAAAAA/Qwable3.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "CRAAAAAAAAAA/Qwable3.5-9B", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Ollama
How to use CRAAAAAAAAAA/Qwable3.5-9B with Ollama:
ollama run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
- Unsloth Studio
How to use CRAAAAAAAAAA/Qwable3.5-9B with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CRAAAAAAAAAA/Qwable3.5-9B to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for CRAAAAAAAAAA/Qwable3.5-9B to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for CRAAAAAAAAAA/Qwable3.5-9B to start chatting
- Pi
How to use CRAAAAAAAAAA/Qwable3.5-9B with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use CRAAAAAAAAAA/Qwable3.5-9B with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use CRAAAAAAAAAA/Qwable3.5-9B with Docker Model Runner:
docker model run hf.co/CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
- Lemonade
How to use CRAAAAAAAAAA/Qwable3.5-9B with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull CRAAAAAAAAAA/Qwable3.5-9B:Q4_K_M
Run and chat with the model
lemonade run user.Qwable3.5-9B-Q4_K_M
List all available models
lemonade list
llm.create_chat_completion(
messages = [
{
"role": "user",
"content": "What is the capital of France?"
}
]
)Qwable3.5-9B
A post-trained derivative of
Qwen/Qwen3.5-9B, distilled from a strong commercial teacher model and aligned through a two-stage SFT (STaR) → GRPO pipeline.
Qwable3.5-9B is a 9B-parameter causal language model built on the Qwen3.5-9B foundation. It keeps the base model's hybrid Gated DeltaNet + Gated Attention architecture and native Multi-Token Prediction (MTP) head, and adds task-specialized behavior via supervised fine-tuning followed by reinforcement learning. It is released under the Apache 2.0 license.
- Developed by: CRAAAAAAAAAA
- Model type: Causal language model (decoder-only, hybrid linear + full attention)
- Base model:
Qwen/Qwen3.5-9B - Parameters: ~9B
- Context length: 262,144 tokens (native), extensible to ~1M
- Languages: English, French
- License: Apache 2.0
- Finetuned from: Qwen3.5-9B via distillation + SFT + GRPO
Model Description
Qwable3.5-9B was produced in three stages on top of the Qwen3.5-9B base:
- Knowledge distillation from a strong commercial teacher model (not disclosed) into the Qwen3.5-9B student via chain-of-thought trace generation.
- Supervised fine-tuning using a STaR (Self-Taught Reasoner) style loop to bootstrap and filter reasoning traces.
- GRPO (Group Relative Policy Optimization) reinforcement learning with an execution-based correctness reward on code and math.
The base model's MTP head is preserved through the adapter-merge process, so the self-speculative decoding path remains available to downstream inference stacks that support it.
Intended use
- Primary: Code generation (Python, algorithms), mathematical reasoning, instruction following, assistant chat in English and French.
- Out of scope: High-stakes medical, legal, or financial decisions without human oversight; safety-critical systems; non-EN/FR languages (not characterized).
How to use
Transformers
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "CRAAAAAAAAAA/Qwable3.5-9B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a Python function that checks if a number is prime."},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
inputs,
max_new_tokens=2048,
temperature=0.7,
top_p=0.9,
)
print(tokenizer.decode(outputs[0][inputs.shape[-1]:], skip_special_tokens=True))
llama.cpp (GGUF)
GGUF quantizations are available directly in this repo.
# Download Q4_K_M (recommended)
huggingface-cli download CRAAAAAAAAAA/Qwable3.5-9B qwable3.5-9b-Q4_K_M.gguf --local-dir .
# Run with optimized flags (52 tok/s on RTX 2060 6GB)
llama-server \
-m qwable3.5-9b-Q4_K_M.gguf \
-ngl 99 -c 2048 -np 1 \
--flash-attn on \
--cache-type-k q8_0 --cache-type-v q8_0 \
--spec-type ngram-map-k \
--host 127.0.0.1 --port 8080
Available quants
| File | Quant | Size | Notes |
|---|---|---|---|
qwable3.5-9b-Q4_K_M.gguf |
Q4_K_M | ~5.6 GB | recommended — fits 6 GB VRAM |
Multi-Token Prediction / speculative decoding: the base architecture ships a trained MTP head usable as a built-in self-drafter on inference stacks that support it. For llama.cpp single-stream on 6 GB,
--spec-type ngram-map-k(prompt-lookahead, zero extra VRAM) adds ~1 tok/s for free; an external draft model degrades throughput due to bandwidth contention.
Training pipeline
Stage 0 — Distillation
- Teacher: strong commercial teacher model (not disclosed)
- Distillation type: sequence-level — SFT on teacher-generated chain-of-thought traces
- Distillation data: private synthetic dataset
- Objective: SFT on teacher CoT outputs (code + math reasoning traces)
Stage 1 — Supervised fine-tuning (STaR)
- Method: STaR (Self-Taught Reasoner) — bootstrap rationales, keep only traces that reach the correct answer, retrain.
- Adapter:
final_sft(LoRA, ~465 MiB) - Dataset: private reasoning dataset (code + math)
- Frameworks: TRL, PEFT
Stage 2 — GRPO (reinforcement learning)
- Method: Group Relative Policy Optimization (GRPO)
- Adapter:
final_grpo(LoRA, ~232 MiB) - Reward signal: execution-based correctness reward (code: unit tests; math: symbolic grader)
- Prompt data: private code + math prompt set
- Frameworks: TRL
Merge & export
Adapters were merged into the base in training order (base → SFT → GRPO), then exported to safetensors and converted to GGUF from the merged checkpoint to keep all formats consistent.
- Frameworks: TRL, PEFT, llama.cpp (conversion + quantization)
Evaluation
Scores measured locally with greedy decoding (temperature=0) on the full test sets unless noted. Qwable3.5-9B was never fine-tuned on any benchmark test set — all post-training data was collected independently of HumanEval, MBPP, GSM8K, MATH, MGSM, AIME, or LiveCodeBench evaluation splits.
| Benchmark | Metric | Qwable3.5-9B (GRPO) | Qwen3.5-9B (base) | Delta |
|---|---|---|---|---|
| HumanEval | pass@1 | 90.2% (148/164) | 87.2% | +3.0 pp |
| MBPP | pass@1 | 84.4% (217/257) | 82.5% | +1.9 pp |
| LiveCodeBench (global) | pass@1 | 32.0% (32/100) | 29.0% | +3.0 pp |
| LiveCodeBench — easy | pass@1 | 100% (14/14) | — | — |
| LiveCodeBench — medium | pass@1 | 46.2% (12/26) | — | — |
| LiveCodeBench — hard | pass@1 | 10.0% (6/60) | — | — |
| GSM8K | acc | 96% | 96% | = |
| MGSM (fr) | acc | 84% | 82% | +2 pp |
| MATH Level 5 | acc | 70% | 77.5% | −7.5 pp |
| AIME | pass@1 | SFT: 53.3% | 43.3% | +10 pp |
- Eval harness: custom scripts (llama.cpp
llama-serverv9637 + Python eval loop) - Decoding: temperature=0, greedy, max 512 tokens, thinking mode OFF
- MATH Level 5 regression note: GRPO/SFT show a slight regression vs. base on competition-math; GSM8K and MGSM are unaffected. Likely a capacity trade-off from code specialization.
Limitations
- MATH Level 5 regressed −7.5pp (77.5→70). Code specialization shifted capacity away from formal multi-step proofs. Real trade-off, not noise.
- Inherited behavior: Qwable3.5-9B inherits the biases, knowledge cutoff, and
failure modes of both
Qwen/Qwen3.5-9Band the commercial teacher model. - Hallucination: like all LLMs, it can produce fluent but incorrect or fabricated content. Do not use outputs as authoritative without verification.
- Domain scope: optimized for code generation and mathematical reasoning; performance on creative writing, general factual Q&A, or non-EN/FR languages is not characterized.
- Safety: no dedicated safety fine-tuning or red-teaming has been performed beyond the base Qwen3.5-9B alignment.
- Not for: high-stakes medical, legal, or financial decisions without human oversight.
License & attribution
This model is released under the Apache License 2.0.
It is a derivative of Qwen/Qwen3.5-9B,
which is itself licensed under Apache 2.0. The original copyright and the base
model's NOTICE (if any) are retained. You must preserve attribution and the
license text when redistributing.
Distillation note: The teacher model used for distillation is not disclosed. Redistribution of this model assumes the teacher's license and terms permit using its outputs to train and openly release a derivative. If you reuse or further distill this model, verify that assumption for your use case.
Citation
@misc{qwable35_9b,
title = {Qwable3.5-9B},
author = {CRAAAAAAAAAA},
year = {2026},
url = {https://huggingface.co/CRAAAAAAAAAA/Qwable3.5-9B}
}
Base model citation:
@misc{qwen3.5,
title = {{Qwen3.5}: Towards Native Multimodal Agents},
author = {{Qwen Team}},
month = {February},
year = {2026},
url = {https://qwen.ai/blog?id=qwen3.5}
}
- Downloads last month
- 48
4-bit
Model tree for CRAAAAAAAAAA/Qwable3.5-9B
Evaluation results
- pass@1 on HumanEvalself-reported90.240
- pass@1 on MBPPself-reported84.440
- accuracy on GSM8Kself-reported96.000
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="CRAAAAAAAAAA/Qwable3.5-9B", filename="qwable3.5-9b-Q4_K_M.gguf", )