Qwen3-8B INT8 for AWS Inferentia2

INT8-quantized Qwen3-8B compiled and optimized for AWS Inferentia2 (inf2.xlarge).

Model Details

  • Base model: Qwen/Qwen3-8B
  • Quantization: INT8 per-channel symmetric
  • Target hardware: AWS Inferentia2 (inf2.xlarge, 1 NeuronCore)
  • tp_degree: 1 (single NeuronCore โ€” leaves Core 1 free for ASR or other models)
  • Max sequence length: 2048
  • Compiled with: neuronx-cc, torch-neuronx, vllm-neuron 0.16

Memory Profile

Component Size
INT8 checkpoint ~8GB
Neuron DRAM (per core) ~10GB
CPU RAM peak (weight loading) ~8GB

Fits comfortably on inf2.xlarge (16GB CPU RAM, 32GB Neuron DRAM) with tp_degree=1, leaving NeuronCore 1 available for a second model (e.g., ASR).

Usage with vLLM

export NEURON_RT_VISIBLE_CORES=0
export NEURON_RT_NUM_CORES=1
export NEURON_COMPILED_ARTIFACTS=/path/to/qwen3-8b-int8-inf2

vllm serve Qwen/Qwen3-8B \
  --max-model-len 2048 \
  --tensor-parallel-size 1 \
  --block-size 8 \
  --port 8000

Key Compile Parameters

from neuronx_distributed_inference.models.qwen3.modeling_qwen3 import (
    NeuronQwen3ForCausalLM, Qwen3InferenceConfig, Qwen3NeuronConfig
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config

neuron_cfg = Qwen3NeuronConfig(
    tp_degree=1,
    batch_size=1,
    n_positions=2048,
    seq_len=2048,
    quantized=True,
    quantization_dtype="int8",
    quantization_type="per_channel_symmetric",
    quantized_checkpoints_path="./quantized_checkpoints",
)
config = Qwen3InferenceConfig(neuron_cfg, load_config=load_pretrained_config(MODEL_DIR))
NeuronQwen3ForCausalLM.save_quantized_state_dict(MODEL_DIR, config)
model = NeuronQwen3ForCausalLM(MODEL_DIR, config)
model.compile(COMPILED_DIR)

Note: n_positions=4096 triggers a DMA transpose bug in neuronx-cc (attention_cte NKI kernel). Use 2048 max.

Deployment Notes

  • Cross-compiled on r5.4xlarge with NEURON_PLATFORM_TARGET_OVERRIDE=inf2
  • Use NEURON_RT_VISIBLE_CORES=0,NEURON_RT_NUM_CORES=1 to pin to Core 0
  • Pair with aqidd/qwen3-asr-1.7b-inf2 on Core 1 for dual-model deployment

Related Models

Downloads last month
21
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support