Qwen3-14B INT8 for AWS Inferentia2

INT8-quantized Qwen3-14B compiled and optimized for AWS Inferentia2 (inf2.xlarge).

Model Details

Base model: Qwen/Qwen3-14B
Quantization: INT8 per-channel symmetric
Target hardware: AWS Inferentia2 (inf2.xlarge, 2 NeuronCores)
tp_degree: 2 (both NeuronCores — maximizes throughput)
Max sequence length: 2048
Compiled with: neuronx-cc, torch-neuronx, vllm-neuron 0.16

Memory Profile

Component	Size
INT8 checkpoint	~14GB
Neuron DRAM (total)	~20GB across 2 cores
CPU RAM peak per process	~7GB (14GB / tp_degree=2)

tp_degree=2 splits weight loading across 2 processes (7GB each), keeping each within the 15GB effective CPU RAM on inf2.xlarge.

Usage with vLLM

export NEURON_RT_NUM_CORES=2
export NEURON_COMPILED_ARTIFACTS=/path/to/qwen3-14b-int8-inf2

vllm serve Qwen/Qwen3-14B \
  --max-model-len 2048 \
  --tensor-parallel-size 2 \
  --block-size 8 \
  --port 8000

Deployment Notes

Uses both NeuronCores — no room for a second model on inf2.xlarge
For dual-model (LLM + ASR), use aqidd/qwen3-8b-int8-inf2 (tp_degree=1) instead
Production-tested at KlinikPintar for Indonesian medical Q&A

Related Models

aqidd/qwen3-8b-int8-inf2 — 8B, tp_degree=1, leaves core for ASR
aqidd/qwen3-asr-1.7b-inf2 — ASR companion

Downloads last month: 76

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support