Instructions to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="ishaqinu/Qwen3.5-9B-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("ishaqinu/Qwen3.5-9B-FP8-Dynamic") model = AutoModelForMultimodalLM.from_pretrained("ishaqinu/Qwen3.5-9B-FP8-Dynamic") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "ishaqinu/Qwen3.5-9B-FP8-Dynamic" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ishaqinu/Qwen3.5-9B-FP8-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/ishaqinu/Qwen3.5-9B-FP8-Dynamic
- SGLang
How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "ishaqinu/Qwen3.5-9B-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ishaqinu/Qwen3.5-9B-FP8-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "ishaqinu/Qwen3.5-9B-FP8-Dynamic" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "ishaqinu/Qwen3.5-9B-FP8-Dynamic", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use ishaqinu/Qwen3.5-9B-FP8-Dynamic with Docker Model Runner:
docker model run hf.co/ishaqinu/Qwen3.5-9B-FP8-Dynamic
Qwen3.5-9B-FP8-Dynamic (Vision-Preserved)
This is an FP8 Dynamic quantized version of the Qwen/Qwen3.5-9B Vision-Language model. It was quantized using llmcompressor with strict layer preservation, designed specifically to maintain 100% of the native vision accuracy while cutting VRAM requirements in half.
Primary Focus: Vision Accuracy Preservation
Vision-Language Models (VLMs) are highly sensitive to quantization in their visual perception components. Quantizing the vision encoder typically degrades performance in spatial recognition, OCR, object counting, and visual grid analysis.
To solve this, this model uses a mixed-precision quantization recipe:
- 🎯 Unquantized Vision Tower: All visual transformer layers, vision projections, and linear attention modules are entirely bypassed and kept in native float16 precision. Visual feature extraction quality remains identical to the original unquantized model.
- 💾 Quantized Language layers: Only standard linear projections in the language model are compressed to FP8 using dynamic activation scaling and static weight scaling.
This combination yields the best of both worlds: native vision accuracy at half the memory footprint.
Key Benefits
- 💾 VRAM Savings: Cuts active VRAM footprint from ~18 GB (BF16) down to ~9.5 GB, allowing it to fit easily on standard 12GB/16GB VRAM GPUs.
- 🎯 Zero Visual Accuracy Loss: Retains the exact native coordinates, bounding box capabilities, grid reading, and visual OCR precision of the original
Qwen/Qwen3.5-9Bmodel. - ⚡ Hardware Acceleration: Faster inference on NVIDIA Ada Lovelace, Hopper, and Blackwell Tensor Cores (e.g., RTX 40-series, L4, A100, H100) using FP8 operations.
Quantization Methodology
Quantization was performed via the one-shot method in llmcompressor with a Dynamic FP8 Activation scaling and Static FP8 Weight scaling scheme.
The following components were explicitly ignored/exempted from quantization to guarantee vision performance:
- Vision Encoder (
re:.*visual.*): Keeps the entire image-processing pipeline in float16. - Language Model Head (
lm_head): Mapped to native precision to preserve textual coherence. - Linear Attention Blocks (
re:.*linear_attn.*): Preserved in native precision.
Quantization Recipe Used:
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForImageTextToText
recipe = QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head", "re:.*visual.*", "re:.*linear_attn.*"]
)
oneshot(model=model, recipe=recipe)
How to Load and Use
from transformers import AutoModelForImageTextToText, AutoProcessor
import torch
model_id = "YOUR_HF_USERNAME/Qwen3.5-9B-FP8-Dynamic"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.float16,
trust_remote_code=True
)
Primary Use Cases
- VRAM-constrained deployments where visual analysis accuracy is critical (e.g., edge surveillance, object counting, OCR, and automated grid-labeling).
- Low-latency batch analysis on affordable single-GPU servers.
- Downloads last month
- 107