Qwen3-14B INT8 for AWS Inferentia2

INT8-quantized Qwen3-14B compiled and optimized for AWS Inferentia2 (inf2.xlarge).

Model Details

  • Base model: Qwen/Qwen3-14B
  • Quantization: INT8 per-channel symmetric
  • Target hardware: AWS Inferentia2 (inf2.xlarge, 2 NeuronCores)
  • tp_degree: 2 (both NeuronCores โ€” maximizes throughput)
  • Max sequence length: 2048
  • Compiled with: neuronx-cc, torch-neuronx, vllm-neuron 0.16

Memory Profile

Component Size
INT8 checkpoint ~14GB
Neuron DRAM (total) ~20GB across 2 cores
CPU RAM peak per process ~7GB (14GB / tp_degree=2)

tp_degree=2 splits weight loading across 2 processes (7GB each), keeping each within the 15GB effective CPU RAM on inf2.xlarge.

Usage with vLLM

export NEURON_RT_NUM_CORES=2
export NEURON_COMPILED_ARTIFACTS=/path/to/qwen3-14b-int8-inf2

vllm serve Qwen/Qwen3-14B \
  --max-model-len 2048 \
  --tensor-parallel-size 2 \
  --block-size 8 \
  --port 8000

Deployment Notes

  • Uses both NeuronCores โ€” no room for a second model on inf2.xlarge
  • For dual-model (LLM + ASR), use aqidd/qwen3-8b-int8-inf2 (tp_degree=1) instead
  • Production-tested at KlinikPintar for Indonesian medical Q&A

Related Models

Downloads last month
76
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support