TheStageAI/gemma-4-E2B-it-qat

A compressed, edge-ready variant of Google's Gemma 4 E2B instruction model, rebuilt from Google's QAT-trained BF16 weights and packaged for edge-lm on Apple Silicon Macs and iPhones. The m checkpoint fits in 1.44 GB; the l checkpoint fits in 1.72 GB and keeps more quality for a small size increase.

Use this repo when artifact size and Apple runtime efficiency matter most. For portable llama.cpp deployment, use the GGUF sibling release.

Why this exists

Gemma 4 Edge models are compact by effective parameter count, but their dense checkpoints are much larger once Per-Layer Embeddings (PLE) are counted. For on-device deployment, the blocking factors are download size, runtime memory footprint, and generation speed.

Google's QAT-trained BF16 checkpoint gives the same production compression pipeline a better starting point. In our measurements, the QAT source improves weight-only distortion and KL under the same byte budgets, while public benchmark deltas remain smaller than the KL movement. The native edge-lm format keeps the custom decoder and PLE codecs that make the smallest artifacts possible.

How it was compressed

We use the same production pipeline as the previous Gemma 4 E2B release, with the dense initialization switched from the original BF16 checkpoint to Google's QAT-trained BF16 checkpoint.

  • Transformer blocks - GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as MLX-compatible per-group weight-only tensors.
  • PLE tables - an AQLM-style vector-quantization codec with robust sensitivity-weighted assignments, stored as compact indices and codebooks and decompressed on the fly.
  • Token embeddings / LM head - flat per-group scalar quantization matched to the same runtime contract as the decoder.
  • Bit-width schedule - the production m and l schedules selected by RCO under fixed byte budgets, then requantized from the QAT BF16 source in one consistent pass.

Operating points

This repo ships two operating points, selected with the size argument:

size Trade-off Artifact size Compression vs BF16 Transformer PLE
m Compact target 1.44 GB 7.1x w3gs32 robust AQLM
l Higher-quality target 1.72 GB 5.9x w4gs32 robust AQLM

The m checkpoint is the smallest production target. The l checkpoint spends about 280 MB more on decoder precision and recovers a larger share of the BF16 reference quality.

Usage

git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm
from edge_lm import load
from mlx_vlm import stream_generate

model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", size="m")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain gravity in one sentence."}],
    tokenize=False,
    add_generation_prompt=True,
)

for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
    print(chunk.text, end="", flush=True)

Vision and audio towers can be loaded on demand:

model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", include_vision=True)  # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", include_audio=True)   # audio transcription

Only the files required for the requested size and modalities are downloaded.

Benchmarks

Every checkpoint is dequantized to a standard BF16 evaluation path and served through vLLM, so the backend is equalized across native and GGUF releases. IFEval p/i means prompt strict / instruction strict, using the corrected public recipe with max_gen_toks=1280.

Model Size Compression MMLU-Pro IFEval p/i
BF16 reference 10.21 GB 1.0x 61.85 75.23 / 82.37
Ours m 1.44 GB 7.1x 47.91 75.42 / 83.09
Ours l 1.72 GB 5.9x 54.45 76.71 / 83.69

MMLU-Pro is the official checkpoint-wise vLLM route with Gemma chat formatting and thinking enabled.

Files

File Contents
config.json Shared Gemma 4 architecture config
model_m.safetensors, model_l.safetensors Quantized decoder weights; each file stores its quantization map in metadata
ple_m.safetensors, ple_l.safetensors Compact PLE payloads
vision_tower.safetensors Optional 4-bit vision tower
audio_tower.safetensors Optional 4-bit audio tower
tokenizer.json, tokenizer_config.json Tokenizer files

License

Released under the MIT License. As a derivative of Gemma, the weights are also subject to the Gemma Terms of Use.

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) - see the references in the edge-lm write-up.

Downloads last month
-
MLX
Hardware compatibility
Log In to add your hardware

Quantized

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for TheStageAI/gemma-4-E2B-it-qat

Quantized
(20)
this model