Instructions to use Qwen/Qwen3.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Qwen/Qwen3.5-9B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Qwen/Qwen3.5-9B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-9B")
model = AutoModelForMultimodalLM.from_pretrained("Qwen/Qwen3.5-9B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Inference
HuggingChat
Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Qwen/Qwen3.5-9B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Qwen/Qwen3.5-9B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Qwen/Qwen3.5-9B

SGLang

How to use Qwen/Qwen3.5-9B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Qwen/Qwen3.5-9B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Qwen/Qwen3.5-9B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Qwen/Qwen3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Qwen/Qwen3.5-9B with Docker Model Runner:
```
docker model run hf.co/Qwen/Qwen3.5-9B
```

[Benchmark] 1x RTX 5090 + Qwen3.5 9B BF16 — 1280 tok/s peak, then TTFT goes from 0.7s to 18s, ShareGPT, concurrency 16–128

#58

by hexgridcloud - opened 6 days ago

Discussion

hexgridcloud

6 days ago

We benchmarked Qwen-3.5 9B BF16 on our custom bench on RTX 5090 [1-GPU] using real world ShareGPT dataset.

TL;DR:

Found a clean ceiling: throughput climbs nicely up to concurrency 64 (~1280 tok/s output) and then just... stops. 128 concurrency level gives basically the same throughput but nearly doubles end-to-end latency and triples time-to-first-token (5.7s → 17.9s p95).

So past 64 concurrency , GPU is not getting more work done — it's just making requests wait longer in the queue.

Model

Model: Qwen/Qwen3.5-9B
HF Path: Qwen/Qwen3.5-9B
Quantization / dtype: BF-16
Context length configured: 4096 max-tokens

Serving

Engine: vllm
Cuda: 13.0.1
Engine flags:{'enable_auto_tool_choice': True, 'exclude_tools_when_tool_choice_none': True, 'tool_call_parser': 'qwen3_coder', 'dtype': 'bfloat16', 'max_model_len': 4098, 'served_model_name': ['Qwen/Qwen3.5-9B'], 'generation_config': 'vllm', 'gpu_memory_utilization': 0.9, 'enable_prefix_caching': True, 'language_model_only': True, 'max_num_batched_tokens': 4096, 'enable_chunked_prefill': True}
Endpoint: /v1/chat/completions

Hardware

GPU: 1x RTX 5090
VRAM: 32GB
CPU: 48 vCPU | 177 GB RAM

Workload

Dataset: ShareGPT sample, [1080 unique prompts] x [4-concurrency settings] => Total 4320 prompts
Conversation shape: Multi-turn response per request
Languages: Multilingual with en/zh/ru/th/ko/fr/pl/ja
max_model_len: 4098
max_tokens per completion: 256
temperature: 0.2

Methodology

Load tool: Custom Harness (currently building but will be public soon)
Concurrency Request levels: 16, 32, 64, 128
Streaming: ON
Metrics

Concurrency Requests Output tok/s E2E p95 TTFT p95
16 1080 444.4 7.48s 0.70s
32 1080 999.9 8.55s 0.99s
64 1080 1279.2 14.59s 5.68s
128 1080 1253.3 27.01s 17.92s
Some charts:

Benchmark started at <<09:41>> in the charts and stopped at <<10.01>>. Benchmark was run first for 16 concurrency, then 32, 64, 128 and the performance flattened out after 64.

Anybody here was able to achieve a higher output for this and can constructively criticise our deployment?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment