Qwen3.6-27B-W4A8

W4A8 quantization of Qwen/Qwen3.6-27B: int4 group-128 weights + int8 dynamic per-token activations (GPTQ via llm-compressor scheme="W4A8").

Why W4A8

int4 weight bandwidth (fast decode) + int8 tensor-core compute (fast prefill) — the best serving quant on the NVIDIA Ampere line (A100 / RTX 3090).

Serving on Ampere (RTX 3090 / A100)

vLLM gates its W4A8 kernels to Hopper. On Ampere the Marlin kernel can run W4A8-int8 but needs a small enablement patch — use vllm-ampere-optimized (prebuilt wheel + Docker image, or the standalone hot-patch). On Hopper it runs out of the box.

Throughput (2× RTX 3090, vLLM tp2, 1024-in / 1024-out)

concurrency output tok/s median TTFT median TPOT
1 (single-user) 46.8 0.84 s 19.8 ms
32 (saturated) 416 14.4 s 63.6 ms

Peak VRAM ~22.8 GiB/card. Single-user ~47 tok/s with sub-second TTFT; saturates ~416 tok/s aggregate.

Downloads last month
23
Safetensors
Model size
28B params
Tensor type
BF16
·
I8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Avesed/Qwen3.6-27B-W4A8

Base model

Qwen/Qwen3.6-27B
Quantized
(487)
this model