Qwen3.5-9B-FP8-Dynamic (Vision-Preserved)

This is an FP8 Dynamic quantized version of the Qwen/Qwen3.5-9B Vision-Language model. It was quantized using llmcompressor with strict layer preservation, designed specifically to maintain 100% of the native vision accuracy while cutting VRAM requirements in half.

Primary Focus: Vision Accuracy Preservation

Vision-Language Models (VLMs) are highly sensitive to quantization in their visual perception components. Quantizing the vision encoder typically degrades performance in spatial recognition, OCR, object counting, and visual grid analysis.

To solve this, this model uses a mixed-precision quantization recipe:

  • 🎯 Unquantized Vision Tower: All visual transformer layers, vision projections, and linear attention modules are entirely bypassed and kept in native float16 precision. Visual feature extraction quality remains identical to the original unquantized model.
  • 💾 Quantized Language layers: Only standard linear projections in the language model are compressed to FP8 using dynamic activation scaling and static weight scaling.

This combination yields the best of both worlds: native vision accuracy at half the memory footprint.

Key Benefits

  • 💾 VRAM Savings: Cuts active VRAM footprint from ~18 GB (BF16) down to ~9.5 GB, allowing it to fit easily on standard 12GB/16GB VRAM GPUs.
  • 🎯 Zero Visual Accuracy Loss: Retains the exact native coordinates, bounding box capabilities, grid reading, and visual OCR precision of the original Qwen/Qwen3.5-9B model.
  • Hardware Acceleration: Faster inference on NVIDIA Ada Lovelace, Hopper, and Blackwell Tensor Cores (e.g., RTX 40-series, L4, A100, H100) using FP8 operations.

Quantization Methodology

Quantization was performed via the one-shot method in llmcompressor with a Dynamic FP8 Activation scaling and Static FP8 Weight scaling scheme.

The following components were explicitly ignored/exempted from quantization to guarantee vision performance:

  1. Vision Encoder (re:.*visual.*): Keeps the entire image-processing pipeline in float16.
  2. Language Model Head (lm_head): Mapped to native precision to preserve textual coherence.
  3. Linear Attention Blocks (re:.*linear_attn.*): Preserved in native precision.

Quantization Recipe Used:

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from transformers import AutoModelForImageTextToText

recipe = QuantizationModifier(
    targets="Linear",
    scheme="FP8_DYNAMIC",
    ignore=["lm_head", "re:.*visual.*", "re:.*linear_attn.*"]
)
oneshot(model=model, recipe=recipe)

How to Load and Use

from transformers import AutoModelForImageTextToText, AutoProcessor
import torch

model_id = "YOUR_HF_USERNAME/Qwen3.5-9B-FP8-Dynamic"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.float16,
    trust_remote_code=True
)

Primary Use Cases

  • VRAM-constrained deployments where visual analysis accuracy is critical (e.g., edge surveillance, object counting, OCR, and automated grid-labeling).
  • Low-latency batch analysis on affordable single-GPU servers.
Downloads last month
107
Safetensors
Model size
9B params
Tensor type
F16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ishaqinu/Qwen3.5-9B-FP8-Dynamic

Finetuned
Qwen/Qwen3.5-9B
Quantized
(315)
this model