Model Overview

  • Model Architecture: Qwen3-30B-A3B-Thinking-2507
    • Input: Text
    • Output: Text
  • Supported Hardware Microarchitecture: AMD MI300/MI325/MI350/MI355
  • ROCm: 7.0+
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark
    • Weight quantization: PerTensor, FP8E4M3, Static
    • Activation quantization: PerTensor, FP8E4M3, Static
  • Calibration Dataset: Pile

This model was built from the Qwen3-30B-A3B-Thinking-2507 model by applying AMD-Quark for FP8 per-tensor quantization.

Model Quantization

The model was quantized from Qwen/Qwen3-30B-A3B-Thinking-2507 using AMD-Quark. Both weights and activations are quantized to FP8 (E4M3) with per-tensor granularity using static calibration.

Quantization scripts:

# pip install amd-quark

from transformers import AutoTokenizer, AutoModelForCausalLM
from quark.torch import ModelQuantizer, export_safetensors
from quark.torch.quantization import FP8E4M3PerTensorSpec
from quark.torch.quantization.config.config import Config, QuantizationConfig

ckpt_path = "Qwen/Qwen3-30B-A3B-Thinking-2507"
exclude_layers = ["lm_head", "*mlp.gate"]
output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-FP8"

# Load the original floating-point model
model = AutoModelForCausalLM.from_pretrained(ckpt_path, device_map="auto", torch_dtype="auto", trust_remote_code=True)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)

# Set the quantization configuration
FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(is_dynamic=False).to_quantization_spec()
W_FP8_A_FP8_PER_TENSOR_CONFIG = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC, weight=FP8_PER_TENSOR_SPEC)
quant_config = Config(global_quant_config=W_FP8_A_FP8_PER_TENSOR_CONFIG, exclude=exclude_layers)

# Apply quantization
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)

# Export quantized model
model = quantizer.freeze(model)
export_safetensors(model, output_dir)
tokenizer.save_pretrained(output_dir)

Accuracy

Benchmark Qwen3-30B-A3B-Thinking-2507 (BF16) Qwen3-30B-A3B-Thinking-2507-FP8 (this model)
GSM8K (5-shot, 1319 questions) 0.836 0.872

Reproducing the evaluation:

# Start the vLLM server
vllm serve amd/Qwen3-30B-A3B-Thinking-2507-FP8 \
    --max-model-len 4096 \
    --trust-remote-code

# Run the GSM8K evaluation (from the vLLM repo)
python tests/evals/gsm8k/gsm8k_eval.py \
    --num-shots 5 \
    --num-questions 1319 \
    --max-tokens 1024

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
30
Safetensors
Model size
31B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Qwen3-30B-A3B-Thinking-2507-FP8

Quantized
(71)
this model