Qwen3-8B INT8 for AWS Inferentia2

INT8-quantized Qwen3-8B compiled and optimized for AWS Inferentia2 (inf2.xlarge).

Model Details

Base model: Qwen/Qwen3-8B
Quantization: INT8 per-channel symmetric
Target hardware: AWS Inferentia2 (inf2.xlarge, 1 NeuronCore)
tp_degree: 1 (single NeuronCore — leaves Core 1 free for ASR or other models)
Max sequence length: 2048
Compiled with: neuronx-cc, torch-neuronx, vllm-neuron 0.16

Memory Profile

Component	Size
INT8 checkpoint	~8GB
Neuron DRAM (per core)	~10GB
CPU RAM peak (weight loading)	~8GB

Fits comfortably on inf2.xlarge (16GB CPU RAM, 32GB Neuron DRAM) with tp_degree=1, leaving NeuronCore 1 available for a second model (e.g., ASR).

Usage with vLLM

export NEURON_RT_VISIBLE_CORES=0
export NEURON_RT_NUM_CORES=1
export NEURON_COMPILED_ARTIFACTS=/path/to/qwen3-8b-int8-inf2

vllm serve Qwen/Qwen3-8B \
  --max-model-len 2048 \
  --tensor-parallel-size 1 \
  --block-size 8 \
  --port 8000

Key Compile Parameters

from neuronx_distributed_inference.models.qwen3.modeling_qwen3 import (
    NeuronQwen3ForCausalLM, Qwen3InferenceConfig, Qwen3NeuronConfig
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

neuron_cfg = Qwen3NeuronConfig(
    tp_degree=1,
    batch_size=1,
    n_positions=2048,
    seq_len=2048,
    quantized=True,
    quantization_dtype="int8",
    quantization_type="per_channel_symmetric",
    quantized_checkpoints_path="./quantized_checkpoints",
)
config = Qwen3InferenceConfig(neuron_cfg, load_config=load_pretrained_config(MODEL_DIR))
NeuronQwen3ForCausalLM.save_quantized_state_dict(MODEL_DIR, config)
model = NeuronQwen3ForCausalLM(MODEL_DIR, config)
model.compile(COMPILED_DIR)

Note: n_positions=4096 triggers a DMA transpose bug in neuronx-cc (attention_cte NKI kernel). Use 2048 max.

Deployment Notes

Cross-compiled on r5.4xlarge with NEURON_PLATFORM_TARGET_OVERRIDE=inf2
Use NEURON_RT_VISIBLE_CORES=0,NEURON_RT_NUM_CORES=1 to pin to Core 0
Pair with aqidd/qwen3-asr-1.7b-inf2 on Core 1 for dual-model deployment

Related Models

aqidd/qwen3-14b-int8-inf2 — 14B, tp_degree=2
aqidd/qwen3-asr-1.7b-inf2 — ASR companion model

Downloads last month: 21

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support