Qwen3-14B INT8 for AWS Inferentia2
INT8-quantized Qwen3-14B compiled and optimized for AWS Inferentia2 (inf2.xlarge).
Model Details
- Base model: Qwen/Qwen3-14B
- Quantization: INT8 per-channel symmetric
- Target hardware: AWS Inferentia2 (
inf2.xlarge, 2 NeuronCores) - tp_degree: 2 (both NeuronCores โ maximizes throughput)
- Max sequence length: 2048
- Compiled with:
neuronx-cc,torch-neuronx,vllm-neuron 0.16
Memory Profile
| Component | Size |
|---|---|
| INT8 checkpoint | ~14GB |
| Neuron DRAM (total) | ~20GB across 2 cores |
| CPU RAM peak per process | ~7GB (14GB / tp_degree=2) |
tp_degree=2 splits weight loading across 2 processes (7GB each), keeping each within the 15GB effective CPU RAM on inf2.xlarge.
Usage with vLLM
export NEURON_RT_NUM_CORES=2
export NEURON_COMPILED_ARTIFACTS=/path/to/qwen3-14b-int8-inf2
vllm serve Qwen/Qwen3-14B \
--max-model-len 2048 \
--tensor-parallel-size 2 \
--block-size 8 \
--port 8000
Deployment Notes
- Uses both NeuronCores โ no room for a second model on inf2.xlarge
- For dual-model (LLM + ASR), use
aqidd/qwen3-8b-int8-inf2(tp_degree=1) instead - Production-tested at KlinikPintar for Indonesian medical Q&A
Related Models
- aqidd/qwen3-8b-int8-inf2 โ 8B, tp_degree=1, leaves core for ASR
- aqidd/qwen3-asr-1.7b-inf2 โ ASR companion
- Downloads last month
- 76
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐ Ask for provider support