YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

Qwen3.5-9B A100 Inference Optimization

Single-user decode throughput on one A100 80GB PCIe. Starting point: 83 tok/s. Final result: 207 tok/s.

Model checkpoints:

Stack: vLLM 0.21.0, torch 2.11.0+cu129, AutoRound 0.12.3, FLA 0.5.0, A100 PCIe sm80


Winning Launch

python apply_qwen35_w4a16_vllm021_patch.py  # see Patches section

vllm serve alityb/Qwen3.5-9B-FullINT4-MTP \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --language-model-only \
  --quantization gptq \
  --enable-chunked-prefill \
  --max-num-batched-tokens 8192 \
  --speculative-config '{"method": "mtp", "num_speculative_tokens": 4}'

What Worked

vLLM 0.21 upgrade

Upgrading from vLLM 0.20.2 to 0.21 was the prerequisite for everything else. vLLM 0.21 added native Qwen3.5 MTP support and a --language-model-only flag that skips vision tower initialization. The upgrade alone added nothing to throughput but unlocked the rest.

MTP speculative decoding

Qwen3.5-9B has a built-in multi-token prediction head trained alongside the model. vLLM 0.21 can use it for speculative decoding with --speculative-config '{"method": "mtp", "num_speculative_tokens": N}'.

The raw streaming benchmark undercounts MTP throughput because each streamed chunk can contain multiple accepted tokens. All numbers below use tokenized output throughput.

Config tok/s
BF16, no MTP 83
BF16, MTP k=1 104
BF16, MTP k=3 198

MTP k=3 gave a 2.4x uplift over baseline BF16.

W4A16 partial quantization (GDN layers excluded)

AutoRound quantized all transformer layers to 4-bit except the GDN linear_attn layers, which were kept at BF16. Combined with MTP k=3, this reached 198 tok/s on the second pod.

Two vLLM patches were required to serve this checkpoint through Qwen3_5ForConditionalGeneration:

  1. quant_config=None when initializing Qwen3_VisionTransformer. The vision tower is never used with --language-model-only but vLLM still initializes it, and the standard quant path picks Marlin kernels for shapes that A100 does not support.
  2. A WeightsMapper remapping text-only checkpoint keys (model.*, lm_head.*) to the conditional wrapper namespace (language_model.model.*).

Full INT4 quantization

All 32 transformer layers quantized to INT4. The resulting checkpoint is 7.7GB vs 16GB for partial W4A16. Raw decode without speculative decoding: 149 tok/s.

MTP head quantization (scale-tuned RTN to Marlin format)

The MTP head weights from Qwen/Qwen3.5-9B are BF16. To load them through the same compiled Marlin path as the target model, they need to be in GPTQ Marlin format. This required:

  1. Scale-tuned RTN quantization (grid search over scale multiplier to minimize MSE)
  2. Packing in Marlin transposed format: qweight (in//8, out), scales (num_groups, out), qzeros (num_groups, out//8)
  3. Storing qzeros with value 7 rather than 8 for ExLlamaV2 compatibility, or 8 for Marlin

The final combination: FullINT4 target model, scale-tuned Marlin MTP head, k=4 gave 207 tok/s.

Config tok/s Acceptance
FullINT4, no MTP 149 -
FullINT4, MTP k=3 184 64%
FullINT4, MTP k=4 207 65%
FullINT4, MTP k=5 192 50%

k=4 is the sweet spot. k=5 acceptance falls below the break-even point.


What Did Not Work

FP8 quantization

Every FP8 attempt produced mixed-script garbage output. The A100 is SM80 and has no hardware FP8 path for Qwen3.5's GDN recurrent state accumulation. Software emulation was too lossy for the GDN layers.

FP8 KV cache

Coherent output but no throughput improvement. KV cache was not the bottleneck at 10k context on A100 80GB.

group_size=64 on the main model

Re-quantizing with group_size=64 doubled the checkpoint size from 7.7GB to 17GB because scales and zero tensors double. The accuracy gain from tighter groups does not compensate for the bandwidth increase. Net loss on A100 PCIe.

BF16 MTP draft with vLLM compiled target

When the target model is GPTQ and the MTP head is BF16, vLLM's MTP draft submodule was instantiated under the global quant config and silently skipped all BF16 weights, loading zeros instead. Acceptance was 0%.

Fixing this required patching qwen3_5_mtp.py to disable quantization for the MTP submodule. That fixed acceptance but introduced a torch.compile crash (Expected tensors only, but got: int) caused by spec_step_idx being passed as a Python int into Inductor graphs.

The final working state with BF16 MTP draft: both MTP classes decorated with @ignore_torch_compile, running eager. This gives correct acceptance (~65%) but the eager MTP forward is slow enough that the compiled path with Marlin-quantized MTP weights is faster overall.

AutoRound SignSGD on MTP head

AutoRound's SignSGD optimizer was applied to the MTP decoder block using pre-computed hidden states from the BF16 backbone as calibration data. The quantized 7/7 layers but acceptance did not improve over scale-tuned RTN. Likely cause: the simplified MTP forward pass used for calibration (using hidden states as a proxy for attention output) was not representative enough to give AutoRound meaningful gradient signal.

ExLlamaV2 GEMV kernel

ExLlamaV2 has a batch=1 GEMV kernel compiled for SM80 that runs ~2x faster than cuBLAS in a standalone micro-benchmark. The patch was applied to gptq_marlin.py and confirmed loading in both the API server and EngineCore processes.

Throughput with the patch: 186 tok/s. Without: 207 tok/s. The patch made things worse.

The reason: vLLM's torch.compile captures the full forward pass including Marlin dequant and matmul into a fused CUDA graph. The compiled Marlin path is already faster than the naive ExLlamaV2.gemm_half_q_half call because it avoids QMatrix handle overhead and benefits from graph-level fusion.

MTP quantization-aware fine-tuning

To get MTP acceptance above 70% at k=5 (which would push throughput above 220), the MTP head needs to be trained against the INT4 target's output distribution, not just quantized. This requires gradient descent through the full MTP transformer block using hidden states from the quantized target model. Not attempted due to training infrastructure requirements.

SGLang

SGLang 0.5.11 ships sgl_kernel with only SM90 and SM100 variants. The SM90 variant requires libnuma.so.1 which was not present in the container. The SM100 variant requires libnvrtc.so.13 (CUDA 13). Neither loads on A100 SM80 with CUDA 12.x. Confirmed not functional on this pod.


Patches

Two patches to vllm/model_executor/models/qwen3_5.py are required (vLLM 0.21):

# Patch 1: vision transformer quant bypass
# In Qwen3_5ForConditionalGeneration.__init__:
self.visual = Qwen3_VisionTransformer(
    config.vision_config,
    norm_eps=getattr(config, "rms_norm_eps", 1e-6),
    quant_config=None,  # vision never quantized in text-only W4A16
    prefix=maybe_prefix(prefix, "visual"),
)

# Patch 2: weight name mapper in Qwen3_5ForConditionalGeneration class body:
hf_to_vllm_mapper = WeightsMapper(
    orig_to_new_prefix={
        "model.language_model.": "language_model.model.",
        "model.visual.": "visual.",
        "model.": "language_model.model.",
        "lm_head.": "language_model.lm_head.",
    }
)

The patch script apply_qwen35_w4a16_vllm021_patch.py applies both idempotently.


Remaining Gap to 250+ tok/s

MTP fine-tuning on INT4 target: the MTP head needs to be trained against the quantized target's output distribution rather than quantized post-hoc.

Downloads last month
292
Safetensors
Model size
3B params
Tensor type
BF16
I32
F16
Inference Providers NEW
This model isn't deployed by any Inference Provider. 馃檵 Ask for provider support