Instructions to use ytgui/Qwen3.5-Sonnet-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ytgui/Qwen3.5-Sonnet-9B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ytgui/Qwen3.5-Sonnet-9B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("ytgui/Qwen3.5-Sonnet-9B")
model = AutoModelForImageTextToText.from_pretrained("ytgui/Qwen3.5-Sonnet-9B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use ytgui/Qwen3.5-Sonnet-9B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ytgui/Qwen3.5-Sonnet-9B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ytgui/Qwen3.5-Sonnet-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ytgui/Qwen3.5-Sonnet-9B

SGLang

How to use ytgui/Qwen3.5-Sonnet-9B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ytgui/Qwen3.5-Sonnet-9B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ytgui/Qwen3.5-Sonnet-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ytgui/Qwen3.5-Sonnet-9B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ytgui/Qwen3.5-Sonnet-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ytgui/Qwen3.5-Sonnet-9B with Docker Model Runner:
```
docker model run hf.co/ytgui/Qwen3.5-Sonnet-9B
```

Qwen3.5-Sonnet-9B

Qwen3.5-Sonnet-9B is a distilled, agent-oriented variant of Qwen3.5-9B, post-trained to deliver stronger performance inside coding agents such as OpenCode and Claude Code. The primary objective of this distillation run is to reduce tool-call failures and enable long, uninterrupted agent trajectories on a single consumer-grade GPU.

✨ Highlights

9B parameters, distilled from frontier teachers.
FP8 quantized weights — ~13 GB on disk, fits comfortably on a single 24 GB GPU.
~200K context with KV-cache on a 24 GB GPU (tested on vllm==0.20.2).
Optimized for agentic coding loops: long tool-call chains, file I/O, shell, and code-edit tools.
Recommended GPU: single 24 GB card (RTX 4090, RTX 4000 BLACKWELL, RTX 4500 Ada, etc.).

📟 Serving with vLLM

# install vllm >= 0.20.2, see: https://vllm.ai/

vllm serve "ytgui/Qwen3.5-Sonnet-9B" \
    --port=8000  \
    --host=localhost   \
    --max-model-len='128K'  \
    --reasoning-parser=qwen3   \
    --enable-auto-tool-choice  \
    --tool-call-parser=qwen3_coder  \
    --gpu-memory-utilization=0.95

🗜️ GGUF Model

The GGUF model is available at: 👉 Qwen3.5-Sonnet-9B-GGUF

Multiple quantization levels are provided for use with llama.cpp and compatible runtimes.

🧪 Distillation Recipe

Teacher mixture

The post-training corpus is a curated mixture from multiple frontier teachers, each chosen for what it does best:

Teacher	Role in the mixture
`claude-opus-4.6`	General chain-of-thought reasoning
`deepseek-v4`	Tool-call traces (tool calls, LLM-as-judge)
`minimax-m2.7`	Tool-call traces (multi-tool orchestration)

Training method

Supervised Fine-Tuning (SFT) on the distilled trajectories.
Offline Reinforcement Learning on preference and outcome-labeled rollouts (successful vs. failed tool calls, completed vs. aborted sessions).

What is trained, what is frozen

To preserve the base model's pretrained knowledge and tokenizer alignment:

Frozen: vision encoder, lm_head, and token embeddings.
Trained: transformer backbone parameters only.

Training framework

A custom training stack built on:

torch
lightning
transformers

The framework supports mixed SFT + offline-RL objectives, gradient checkpointing, and FP8 weight casting at the end of post-training.

🛠️ Agentic Coding — Goals & Behavior

The distillation objective explicitly targets agent reliability, not just benchmark scores:

Fewer malformed tool calls (schema, JSON, argument errors).
Better recovery after a failed tool invocation.
Longer stable trajectories without collapse, repetition, or premature termination.

Long-running session screenshots

The screenshots below show the model running continuously for up to 10 minutes inside opencode and claude-code without interruption or tool call failure.

claude-code session: ask for locate "multi-head attention implementation" in pytorch project

claude-code session: ask for "understand project layout" in sqlite project

opencode session: ask for "explain terminologies" in pgvector project

⚠️ Limitations

FP8 weights may show small quality deltas vs. BF16 on edge tasks.
Vision encoder is preserved but not the focus of this post-training; multimodal performance is inherited from the base model.
Distilled behavior reflects the teacher mixture and may exhibit teacher-specific stylistic patterns.

Downloads last month: 220

Safetensors

Model size

9B params

Tensor type

BF16

F8_E4M3

Model tree for ytgui/Qwen3.5-Sonnet-9B

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(218)

this model

Quantizations

3 models