Qwen3-8B INT8 for AWS Inferentia2
INT8-quantized Qwen3-8B compiled and optimized for AWS Inferentia2 (inf2.xlarge).
Model Details
- Base model: Qwen/Qwen3-8B
- Quantization: INT8 per-channel symmetric
- Target hardware: AWS Inferentia2 (
inf2.xlarge, 1 NeuronCore) - tp_degree: 1 (single NeuronCore โ leaves Core 1 free for ASR or other models)
- Max sequence length: 2048
- Compiled with:
neuronx-cc,torch-neuronx,vllm-neuron 0.16
Memory Profile
| Component | Size |
|---|---|
| INT8 checkpoint | ~8GB |
| Neuron DRAM (per core) | ~10GB |
| CPU RAM peak (weight loading) | ~8GB |
Fits comfortably on inf2.xlarge (16GB CPU RAM, 32GB Neuron DRAM) with tp_degree=1, leaving NeuronCore 1 available for a second model (e.g., ASR).
Usage with vLLM
export NEURON_RT_VISIBLE_CORES=0
export NEURON_RT_NUM_CORES=1
export NEURON_COMPILED_ARTIFACTS=/path/to/qwen3-8b-int8-inf2
vllm serve Qwen/Qwen3-8B \
--max-model-len 2048 \
--tensor-parallel-size 1 \
--block-size 8 \
--port 8000
Key Compile Parameters
from neuronx_distributed_inference.models.qwen3.modeling_qwen3 import (
NeuronQwen3ForCausalLM, Qwen3InferenceConfig, Qwen3NeuronConfig
)
from neuronx_distributed_inference.utils.hf_adapter import load_pretrained_config
neuron_cfg = Qwen3NeuronConfig(
tp_degree=1,
batch_size=1,
n_positions=2048,
seq_len=2048,
quantized=True,
quantization_dtype="int8",
quantization_type="per_channel_symmetric",
quantized_checkpoints_path="./quantized_checkpoints",
)
config = Qwen3InferenceConfig(neuron_cfg, load_config=load_pretrained_config(MODEL_DIR))
NeuronQwen3ForCausalLM.save_quantized_state_dict(MODEL_DIR, config)
model = NeuronQwen3ForCausalLM(MODEL_DIR, config)
model.compile(COMPILED_DIR)
Note:
n_positions=4096triggers a DMA transpose bug inneuronx-cc(attention_cteNKI kernel). Use 2048 max.
Deployment Notes
- Cross-compiled on
r5.4xlargewithNEURON_PLATFORM_TARGET_OVERRIDE=inf2 - Use
NEURON_RT_VISIBLE_CORES=0,NEURON_RT_NUM_CORES=1to pin to Core 0 - Pair with
aqidd/qwen3-asr-1.7b-inf2on Core 1 for dual-model deployment
Related Models
- aqidd/qwen3-14b-int8-inf2 โ 14B, tp_degree=2
- aqidd/qwen3-asr-1.7b-inf2 โ ASR companion model
- Downloads last month
- 21
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support