Model Overview
- Model Architecture: Qwen3-30B-A3B-Thinking-2507
- Input: Text
- Output: Text
- Supported Hardware Microarchitecture: AMD MI300/MI325/MI350/MI355
- ROCm: 7.0+
- Operating System(s): Linux
- Inference Engine: vLLM
- Model Optimizer: AMD-Quark
- Weight quantization: PerTensor, FP8E4M3, Static
- Activation quantization: PerTensor, FP8E4M3, Static
- Calibration Dataset: Pile
This model was built from the Qwen3-30B-A3B-Thinking-2507 model by applying AMD-Quark for FP8 per-tensor quantization.
Model Quantization
The model was quantized from Qwen/Qwen3-30B-A3B-Thinking-2507 using AMD-Quark. Both weights and activations are quantized to FP8 (E4M3) with per-tensor granularity using static calibration.
Quantization scripts:
# pip install amd-quark
from transformers import AutoTokenizer, AutoModelForCausalLM
from quark.torch import ModelQuantizer, export_safetensors
from quark.torch.quantization import FP8E4M3PerTensorSpec
from quark.torch.quantization.config.config import Config, QuantizationConfig
ckpt_path = "Qwen/Qwen3-30B-A3B-Thinking-2507"
exclude_layers = ["lm_head", "*mlp.gate"]
output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-FP8"
# Load the original floating-point model
model = AutoModelForCausalLM.from_pretrained(ckpt_path, device_map="auto", torch_dtype="auto", trust_remote_code=True)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)
# Set the quantization configuration
FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(is_dynamic=False).to_quantization_spec()
W_FP8_A_FP8_PER_TENSOR_CONFIG = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC, weight=FP8_PER_TENSOR_SPEC)
quant_config = Config(global_quant_config=W_FP8_A_FP8_PER_TENSOR_CONFIG, exclude=exclude_layers)
# Apply quantization
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)
# Export quantized model
model = quantizer.freeze(model)
export_safetensors(model, output_dir)
tokenizer.save_pretrained(output_dir)
Accuracy
| Benchmark | Qwen3-30B-A3B-Thinking-2507 (BF16) | Qwen3-30B-A3B-Thinking-2507-FP8 (this model) |
| GSM8K (5-shot, 1319 questions) | 0.836 | 0.872 |
Reproducing the evaluation:
# Start the vLLM server
vllm serve amd/Qwen3-30B-A3B-Thinking-2507-FP8 \
--max-model-len 4096 \
--trust-remote-code
# Run the GSM8K evaluation (from the vLLM repo)
python tests/evals/gsm8k/gsm8k_eval.py \
--num-shots 5 \
--num-questions 1319 \
--max-tokens 1024
License
Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.
- Downloads last month
- 30
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for amd/Qwen3-30B-A3B-Thinking-2507-FP8
Base model
Qwen/Qwen3-30B-A3B-Thinking-2507