Model Overview

Model Architecture: Qwen3-30B-A3B-Thinking-2507
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI300/MI325/MI350/MI355
ROCm: 7.0+
Operating System(s): Linux
Inference Engine: vLLM
Model Optimizer: AMD-Quark
- Weight quantization: PerTensor, FP8E4M3, Static
- Activation quantization: PerTensor, FP8E4M3, Static
Calibration Dataset: Pile

This model was built from the Qwen3-30B-A3B-Thinking-2507 model by applying AMD-Quark for FP8 per-tensor quantization.

Model Quantization

The model was quantized from Qwen/Qwen3-30B-A3B-Thinking-2507 using AMD-Quark. Both weights and activations are quantized to FP8 (E4M3) with per-tensor granularity using static calibration.

Quantization scripts:

# pip install amd-quark

from transformers import AutoTokenizer, AutoModelForCausalLM
from quark.torch import ModelQuantizer, export_safetensors
from quark.torch.quantization import FP8E4M3PerTensorSpec
from quark.torch.quantization.config.config import Config, QuantizationConfig

ckpt_path = "Qwen/Qwen3-30B-A3B-Thinking-2507"
exclude_layers = ["lm_head", "*mlp.gate"]
output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-FP8"

# Load the original floating-point model
model = AutoModelForCausalLM.from_pretrained(ckpt_path, device_map="auto", torch_dtype="auto", trust_remote_code=True)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)

# Set the quantization configuration
FP8_PER_TENSOR_SPEC = FP8E4M3PerTensorSpec(is_dynamic=False).to_quantization_spec()
W_FP8_A_FP8_PER_TENSOR_CONFIG = QuantizationConfig(input_tensors=FP8_PER_TENSOR_SPEC, weight=FP8_PER_TENSOR_SPEC)
quant_config = Config(global_quant_config=W_FP8_A_FP8_PER_TENSOR_CONFIG, exclude=exclude_layers)

# Apply quantization
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)

# Export quantized model
model = quantizer.freeze(model)
export_safetensors(model, output_dir)
tokenizer.save_pretrained(output_dir)

Accuracy

Benchmark	Qwen3-30B-A3B-Thinking-2507 (BF16)	Qwen3-30B-A3B-Thinking-2507-FP8 (this model)
GSM8K (5-shot, 1319 questions)	0.836	0.872

Reproducing the evaluation:

# Start the vLLM server
vllm serve amd/Qwen3-30B-A3B-Thinking-2507-FP8 \
    --max-model-len 4096 \
    --trust-remote-code

# Run the GSM8K evaluation (from the vLLM repo)
python tests/evals/gsm8k/gsm8k_eval.py \
    --num-shots 5 \
    --num-questions 1319 \
    --max-tokens 1024

License

Downloads last month: 30

Safetensors

Model size

31B params

Tensor type

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Qwen3-30B-A3B-Thinking-2507-FP8

Base model

Qwen/Qwen3-30B-A3B-Thinking-2507

Quantized

(71)

this model