Qwen3.5-4B — FP8 (W8A8) Quantized

FP8 dynamic quantization of Qwen/Qwen3.5-4B using llmcompressor.

  • Vision encoder: kept at BF16 (unquantized)
  • Language model: W8A8 FP8_DYNAMIC (weights + activations)
  • Format: compressed-tensors (llmcompressor)
  • Size: ~5.2 GB vs ~8 GB BF16 — 1.5x smaller

Quantization details

  • Tool: llmcompressor v0.11.0
  • Method: QuantizationModifier, scheme=FP8_DYNAMIC
  • Hardware: NVIDIA B200

ScreenSpot-v2 Benchmark

Evaluated on HongxinLi/ScreenSpot_v2 (1,272 samples).

Source BF16 FP8 Delta
android 73.9% 69.2% -4.7%
ios 78.2% 67.6% -10.5%
forum 45.6% 31.6% -13.9%
gitlab 53.4% 46.6% -6.8%
macos 53.1% 53.7% +0.7%
windows 47.2% 49.1% +1.9%
shop 43.3% 39.6% -3.7%
tool 35.8% 30.8% -5.0%
Overall 56.5% 51.6% -4.9%

Usage

from vllm import LLM
llm = LLM(
    model="Shashwat42/Qwen3.5-4B-FP8",
    quantization="compressed-tensors",
    dtype="bfloat16",
)

Or serve:

vllm serve Shashwat42/Qwen3.5-4B-FP8 \
  --quantization compressed-tensors \
  --dtype bfloat16
Downloads last month
90
Safetensors
Model size
5B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Shashwat42/Qwen3.5-4B-FP8

Finetuned
Qwen/Qwen3.5-4B
Quantized
(263)
this model