Qwen3-30B-A3B-FP8-Dynamic

Post-training quantized checkpoint of Qwen/Qwen3-30B-A3B produced by the pex/baselines pipeline as part of the PEX paper baselines.

Quantization

Knob	Value
Method	FP8
Scheme	`FP8_DYNAMIC`
Group size	`-`
Producer tool	`llmcompressor`
Format	`compressed-tensors`

Calibration

No calibration data — FP8 W8A8 uses per-channel weight max-abs + dynamic per-token activations.

Skipped modules

Qwen3 MoE: skip lm_head, router (mlp.gate), and shared-expert gate. All MLP up/down/gate-proj inside each expert ARE quantized.

Serving with vLLM

vllm serve morriszjm/Qwen3-30B-A3B-FP8-Dynamic \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 32768

vLLM auto-detects the quantization format from config.json.

Reproducing

The exact producer recipe (including the calibration hash above) is in meta.json next to the weights.

Reference

This checkpoint is one of three quantization baselines (RTN / GPTQ / AWQ) used to anchor the Pareto plots in the PEX paper. Not a SOTA release — it is an out-of-the-box reference produced with each method's paper-default recipe to enable fair method-vs-method comparison.

Downloads last month: 15

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3

Model tree for morriszjm/Qwen3-30B-A3B-FP8-Dynamic

Base model

Qwen/Qwen3-30B-A3B-Base

Finetuned

Qwen/Qwen3-30B-A3B

Quantized

(116)

this model