TheStageAI/gemma-4-E2B-it-qat

A compressed, edge-ready variant of Google's Gemma 4 E2B instruction model, rebuilt from Google's QAT-trained BF16 weights and packaged for edge-lm on Apple Silicon Macs and iPhones. The m checkpoint fits in 1.44 GB; the l checkpoint fits in 1.72 GB and keeps more quality for a small size increase.

Run it with: TheStageAI/edge-lm
Compression source: google/gemma-4-E2B-it-qat-q4_0-unquantized
BF16 reference: google/gemma-4-E2B-it
GGUF release: TheStageAI/gemma-4-E2B-it-qat-GGUF

Use this repo when artifact size and Apple runtime efficiency matter most. For portable llama.cpp deployment, use the GGUF sibling release.

Why this exists

Gemma 4 Edge models are compact by effective parameter count, but their dense checkpoints are much larger once Per-Layer Embeddings (PLE) are counted. For on-device deployment, the blocking factors are download size, runtime memory footprint, and generation speed.

Google's QAT-trained BF16 checkpoint gives the same production compression pipeline a better starting point. In our measurements, the QAT source improves weight-only distortion and KL under the same byte budgets, while public benchmark deltas remain smaller than the KL movement. The native edge-lm format keeps the custom decoder and PLE codecs that make the smallest artifacts possible.

How it was compressed

We use the same production pipeline as the previous Gemma 4 E2B release, with the dense initialization switched from the original BF16 checkpoint to Google's QAT-trained BF16 checkpoint.

Transformer blocks - GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as MLX-compatible per-group weight-only tensors.
PLE tables - an AQLM-style vector-quantization codec with robust sensitivity-weighted assignments, stored as compact indices and codebooks and decompressed on the fly.
Token embeddings / LM head - flat per-group scalar quantization matched to the same runtime contract as the decoder.
Bit-width schedule - the production m and l schedules selected by RCO under fixed byte budgets, then requantized from the QAT BF16 source in one consistent pass.

Operating points

This repo ships two operating points, selected with the size argument:

`size`	Trade-off	Artifact size	Compression vs BF16	Transformer	PLE
`m`	Compact target	1.44 GB	7.1x	w3gs32	robust AQLM
`l`	Higher-quality target	1.72 GB	5.9x	w4gs32	robust AQLM

The m checkpoint is the smallest production target. The l checkpoint spends about 280 MB more on decoder precision and recovers a larger share of the BF16 reference quality.

Usage

git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm

from edge_lm import load
from mlx_vlm import stream_generate

model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", size="m")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain gravity in one sentence."}],
    tokenize=False,
    add_generation_prompt=True,
)

for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
    print(chunk.text, end="", flush=True)

Vision and audio towers can be loaded on demand:

model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", include_vision=True)  # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", include_audio=True)   # audio transcription

Only the files required for the requested size and modalities are downloaded.

Benchmarks

Every checkpoint is dequantized to a standard BF16 evaluation path and served through vLLM, so the backend is equalized across native and GGUF releases. IFEval p/i means prompt strict / instruction strict, using the corrected public recipe with max_gen_toks=1280.

Model	Size	Compression	MMLU-Pro	IFEval p/i
BF16 reference	10.21 GB	1.0x	61.85	75.23 / 82.37
Ours `m`	1.44 GB	7.1x	47.91	75.42 / 83.09
Ours `l`	1.72 GB	5.9x	54.45	76.71 / 83.69

MMLU-Pro is the official checkpoint-wise vLLM route with Gemma chat formatting and thinking enabled.

Files

File	Contents
`config.json`	Shared Gemma 4 architecture config
`model_m.safetensors`, `model_l.safetensors`	Quantized decoder weights; each file stores its quantization map in metadata
`ple_m.safetensors`, `ple_l.safetensors`	Compact PLE payloads
`vision_tower.safetensors`	Optional 4-bit vision tower
`audio_tower.safetensors`	Optional 4-bit audio tower
`tokenizer.json`, `tokenizer_config.json`	Tokenizer files

License

Released under the MIT License. As a derivative of Gemma, the weights are also subject to the Gemma Terms of Use.

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) - see the references in the edge-lm write-up.

Downloads last month: -

MLX

Hardware compatibility

Quantized

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for TheStageAI/gemma-4-E2B-it-qat

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

google/gemma-4-E2B-it-qat-q4_0-unquantized

Quantized

(20)

this model