Gemma3-12B INT8 for AWS Inferentia2 (Archived)

โš ๏ธ Deployment Status: NOT VIABLE on inf2.xlarge โ€” archived for reference.

INT8-quantized Gemma3-12B compiled for AWS Inferentia2. Compilation succeeded, but deployment on inf2.xlarge fails with OOM (SIGKILL) at model loading time.

Why It Fails on inf2.xlarge

Neuron Runtime DMA-pins weight pages during loading โ€” they cannot be swapped (swapents:0 in kernel OOM messages). On inf2.xlarge (16GB CPU RAM):

Component Size
INT8 checkpoint ~13GB
DMA-pinned during load ~13GB
OS + Python overhead ~2-3GB
Total needed ~15-16GB
Available ~15GB

Result: OOM kill every time, regardless of swap size. Swap is irrelevant for DMA-pinned pages.

Minimum viable instance: inf2.8xlarge (64GB CPU RAM) or tp_degree=2 (which would use both NeuronCores, leaving none for ASR).

Architecture-Specific Issues

Gemma3's Gemma3InferenceConfig has a strict __init__ (no **kwargs). The standard NxD json save/load cycle breaks because:

  1. neuron_config.json saved with full HF PretrainedConfig fields
  2. Gemma3InferenceConfig.__init__() rejects any field other than neuron_config and fused_spec_config
  3. vllm_neuron loader discards pre-built config and re-loads from json

Workarounds were developed and documented but ultimately moot due to the memory wall.

Compile Details

  • Base model: google/gemma-3-12b-it
  • tp_degree: 1
  • n_positions: 2048
  • Compiled with: neuronx-cc, torch-neuronx, vllm-neuron 0.16
  • NEFF size: 108MB (unusually small โ€” architecture-specific)

For Production Use

Use aqidd/qwen3-8b-int8-inf2 or aqidd/qwen3-14b-int8-inf2 instead.

Related Models

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support