Qwen3.5-4B — FP8 (W8A8) Quantized
FP8 dynamic quantization of Qwen/Qwen3.5-4B using llmcompressor.
- Vision encoder: kept at BF16 (unquantized)
- Language model: W8A8 FP8_DYNAMIC (weights + activations)
- Format: compressed-tensors (llmcompressor)
- Size: ~5.2 GB vs ~8 GB BF16 — 1.5x smaller
Quantization details
- Tool: llmcompressor v0.11.0
- Method: QuantizationModifier, scheme=FP8_DYNAMIC
- Hardware: NVIDIA B200
ScreenSpot-v2 Benchmark
Evaluated on HongxinLi/ScreenSpot_v2 (1,272 samples).
| Source | BF16 | FP8 | Delta |
|---|---|---|---|
| android | 73.9% | 69.2% | -4.7% |
| ios | 78.2% | 67.6% | -10.5% |
| forum | 45.6% | 31.6% | -13.9% |
| gitlab | 53.4% | 46.6% | -6.8% |
| macos | 53.1% | 53.7% | +0.7% |
| windows | 47.2% | 49.1% | +1.9% |
| shop | 43.3% | 39.6% | -3.7% |
| tool | 35.8% | 30.8% | -5.0% |
| Overall | 56.5% | 51.6% | -4.9% |
Usage
from vllm import LLM
llm = LLM(
model="Shashwat42/Qwen3.5-4B-FP8",
quantization="compressed-tensors",
dtype="bfloat16",
)
Or serve:
vllm serve Shashwat42/Qwen3.5-4B-FP8 \
--quantization compressed-tensors \
--dtype bfloat16
- Downloads last month
- 90
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support