Gemma3-12B INT8 for AWS Inferentia2 (Archived)

⚠️ Deployment Status: NOT VIABLE on inf2.xlarge — archived for reference.

INT8-quantized Gemma3-12B compiled for AWS Inferentia2. Compilation succeeded, but deployment on inf2.xlarge fails with OOM (SIGKILL) at model loading time.

Why It Fails on inf2.xlarge

Neuron Runtime DMA-pins weight pages during loading — they cannot be swapped (swapents:0 in kernel OOM messages). On inf2.xlarge (16GB CPU RAM):

Component	Size
INT8 checkpoint	~13GB
DMA-pinned during load	~13GB
OS + Python overhead	~2-3GB
Total needed	~15-16GB
Available	~15GB

Result: OOM kill every time, regardless of swap size. Swap is irrelevant for DMA-pinned pages.

Minimum viable instance: inf2.8xlarge (64GB CPU RAM) or tp_degree=2 (which would use both NeuronCores, leaving none for ASR).

Architecture-Specific Issues

Gemma3's Gemma3InferenceConfig has a strict __init__ (no **kwargs). The standard NxD json save/load cycle breaks because:

neuron_config.json saved with full HF PretrainedConfig fields
Gemma3InferenceConfig.__init__() rejects any field other than neuron_config and fused_spec_config
vllm_neuron loader discards pre-built config and re-loads from json

Workarounds were developed and documented but ultimately moot due to the memory wall.

Compile Details

Base model: google/gemma-3-12b-it
tp_degree: 1
n_positions: 2048
Compiled with: neuronx-cc, torch-neuronx, vllm-neuron 0.16
NEFF size: 108MB (unusually small — architecture-specific)

For Production Use

Use aqidd/qwen3-8b-int8-inf2 or aqidd/qwen3-14b-int8-inf2 instead.

Related Models

aqidd/qwen3-8b-int8-inf2 — viable alternative
aqidd/qwen3-14b-int8-inf2 — higher quality

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support