Gemma3-12B INT8 for AWS Inferentia2 (Archived)
โ ๏ธ Deployment Status: NOT VIABLE on inf2.xlarge โ archived for reference.
INT8-quantized Gemma3-12B compiled for AWS Inferentia2. Compilation succeeded, but deployment on inf2.xlarge fails with OOM (SIGKILL) at model loading time.
Why It Fails on inf2.xlarge
Neuron Runtime DMA-pins weight pages during loading โ they cannot be swapped
(swapents:0 in kernel OOM messages). On inf2.xlarge (16GB CPU RAM):
| Component | Size |
|---|---|
| INT8 checkpoint | ~13GB |
| DMA-pinned during load | ~13GB |
| OS + Python overhead | ~2-3GB |
| Total needed | ~15-16GB |
| Available | ~15GB |
Result: OOM kill every time, regardless of swap size. Swap is irrelevant for DMA-pinned pages.
Minimum viable instance: inf2.8xlarge (64GB CPU RAM) or tp_degree=2
(which would use both NeuronCores, leaving none for ASR).
Architecture-Specific Issues
Gemma3's Gemma3InferenceConfig has a strict __init__ (no **kwargs).
The standard NxD json save/load cycle breaks because:
neuron_config.jsonsaved with full HF PretrainedConfig fieldsGemma3InferenceConfig.__init__()rejects any field other thanneuron_configandfused_spec_config- vllm_neuron loader discards pre-built config and re-loads from json
Workarounds were developed and documented but ultimately moot due to the memory wall.
Compile Details
- Base model: google/gemma-3-12b-it
- tp_degree: 1
- n_positions: 2048
- Compiled with:
neuronx-cc,torch-neuronx,vllm-neuron 0.16 - NEFF size: 108MB (unusually small โ architecture-specific)
For Production Use
Use aqidd/qwen3-8b-int8-inf2 or aqidd/qwen3-14b-int8-inf2 instead.
Related Models
- aqidd/qwen3-8b-int8-inf2 โ viable alternative
- aqidd/qwen3-14b-int8-inf2 โ higher quality