Qwen3-30B-A3B-FP8-Dynamic
Post-training quantized checkpoint of Qwen/Qwen3-30B-A3B
produced by the pex/baselines pipeline as part of the PEX paper baselines.
Quantization
| Knob | Value |
|---|---|
| Method | FP8 |
| Scheme | FP8_DYNAMIC |
| Group size | - |
| Producer tool | llmcompressor |
| Format | compressed-tensors |
Calibration
No calibration data — FP8 W8A8 uses per-channel weight max-abs + dynamic per-token activations.
Skipped modules
Qwen3 MoE: skip lm_head, router (mlp.gate), and shared-expert gate. All MLP up/down/gate-proj inside each expert ARE quantized.
Serving with vLLM
vllm serve morriszjm/Qwen3-30B-A3B-FP8-Dynamic \
--tensor-parallel-size 4 \
--gpu-memory-utilization 0.90 \
--max-model-len 32768
vLLM auto-detects the quantization format from config.json.
Reproducing
The exact producer recipe (including the calibration hash above) is in
meta.json next to the weights.
Reference
This checkpoint is one of three quantization baselines (RTN / GPTQ / AWQ) used to anchor the Pareto plots in the PEX paper. Not a SOTA release — it is an out-of-the-box reference produced with each method's paper-default recipe to enable fair method-vs-method comparison.
- Downloads last month
- 15