Instructions to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="ishaqinu/Qwen3.5-9B-FP8-Dynamic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("ishaqinu/Qwen3.5-9B-FP8-Dynamic")
model = AutoModelForMultimodalLM.from_pretrained("ishaqinu/Qwen3.5-9B-FP8-Dynamic")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "ishaqinu/Qwen3.5-9B-FP8-Dynamic"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ishaqinu/Qwen3.5-9B-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/ishaqinu/Qwen3.5-9B-FP8-Dynamic

SGLang

How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "ishaqinu/Qwen3.5-9B-FP8-Dynamic" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ishaqinu/Qwen3.5-9B-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "ishaqinu/Qwen3.5-9B-FP8-Dynamic" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "ishaqinu/Qwen3.5-9B-FP8-Dynamic",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with Docker Model Runner:
```
docker model run hf.co/ishaqinu/Qwen3.5-9B-FP8-Dynamic
```

Qwen3.5-9B-FP8-Dynamic (Vision-Preserved)

This is an FP8 Dynamic quantized version of the Qwen/Qwen3.5-9B Vision-Language model. It was quantized using llmcompressor with strict layer preservation, designed specifically to maintain 100% of the native vision accuracy while cutting VRAM requirements in half.

Primary Focus: Vision Accuracy Preservation

Vision-Language Models (VLMs) are highly sensitive to quantization in their visual perception components. Quantizing the vision encoder typically degrades performance in spatial recognition, OCR, object counting, and visual grid analysis.

To solve this, this model uses a mixed-precision quantization recipe:

🎯 Unquantized Vision Tower: All visual transformer layers, vision projections, and linear attention modules are entirely bypassed and kept in native float16 precision. Visual feature extraction quality remains identical to the original unquantized model.
💾 Quantized Language layers: Only standard linear projections in the language model are compressed to FP8 using dynamic activation scaling and static weight scaling.

This combination yields the best of both worlds: native vision accuracy at half the memory footprint.

Key Benefits

💾 VRAM Savings: Cuts active VRAM footprint from ~18 GB (BF16) down to ~9.5 GB, allowing it to fit easily on standard 12GB/16GB VRAM GPUs.
🎯 Zero Visual Accuracy Loss: Retains the exact native coordinates, bounding box capabilities, grid reading, and visual OCR precision of the original Qwen/Qwen3.5-9B model.
⚡ Hardware Acceleration: Faster inference on NVIDIA Ada Lovelace, Hopper, and Blackwell Tensor Cores (e.g., RTX 40-series, L4, A100, H100) using FP8 operations.

Quantization Methodology

Quantization was performed via the one-shot method in llmcompressor with a Dynamic FP8 Activation scaling and Static FP8 Weight scaling scheme.

The following components were explicitly ignored/exempted from quantization to guarantee vision performance:

Vision Encoder (re:.*visual.*): Keeps the entire image-processing pipeline in float16.
Language Model Head (lm_head): Mapped to native precision to preserve textual coherence.
Linear Attention Blocks (re:.*linear_attn.*): Preserved in native precision.

Quantization Recipe Used:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForImageTextToText

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*visual.*", "re:.*linear_attn.*"]
)
oneshot(model=model, recipe=recipe)

How to Load and Use

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model_id = "YOUR_HF_USERNAME/Qwen3.5-9B-FP8-Dynamic"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

Primary Use Cases

VRAM-constrained deployments where visual analysis accuracy is critical (e.g., edge surveillance, object counting, OCR, and automated grid-labeling).
Low-latency batch analysis on affordable single-GPU servers.

Downloads last month: 107

Safetensors

Model size

9B params

Tensor type

F16

F8_E4M3

Model tree for ishaqinu/Qwen3.5-9B-FP8-Dynamic

Base model

Qwen/Qwen3.5-9B-Base

Finetuned

Qwen/Qwen3.5-9B

Quantized

(315)

this model