Qwen3-30B-A3B-FP8-Dynamic

Post-training quantized checkpoint of Qwen/Qwen3-30B-A3B produced by the pex/baselines pipeline as part of the PEX paper baselines.

Quantization

Knob Value
Method FP8
Scheme FP8_DYNAMIC
Group size -
Producer tool llmcompressor
Format compressed-tensors

Calibration

No calibration data — FP8 W8A8 uses per-channel weight max-abs + dynamic per-token activations.

Skipped modules

Qwen3 MoE: skip lm_head, router (mlp.gate), and shared-expert gate. All MLP up/down/gate-proj inside each expert ARE quantized.

Serving with vLLM

vllm serve morriszjm/Qwen3-30B-A3B-FP8-Dynamic \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

vLLM auto-detects the quantization format from config.json.

Reproducing

The exact producer recipe (including the calibration hash above) is in meta.json next to the weights.

Reference

This checkpoint is one of three quantization baselines (RTN / GPTQ / AWQ) used to anchor the Pareto plots in the PEX paper. Not a SOTA release — it is an out-of-the-box reference produced with each method's paper-default recipe to enable fair method-vs-method comparison.

Downloads last month
15
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for morriszjm/Qwen3-30B-A3B-FP8-Dynamic

Quantized
(116)
this model