TheStageAI/gemma-4-E2B-it

A compressed, edge-ready variant of Google's Gemma 4 E2B (instruction-tuned), packaged for MLX on Apple Silicon Macs and iPhones. The checkpoint is roughly 7× smaller than the original (fits in ~1.4 GB) while preserving the capabilities that matter most for on-device assistants: general world knowledge, instruction following, and tool use.

Run it with: TheStageAI/edge-lm
Base model: google/gemma-4-E2B-it
Sibling release: TheStageAI/gemma-4-E4B-it
Write-up: 7× size reduction for Gemma 4 Edge models — Compressing PLE architectures.

Why this exists

Gemma 4 E2B is a "2B" model by effective parameter count, but the dense checkpoint is closer to 5.1B parameters once Per-Layer Embeddings (PLE) are counted — and in BF16 the PLE table alone is ~4.7 GB. On mobile hardware, three things block deployment: download size, runtime memory footprint (iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural structure to address all three at once.

How it was compressed

Transformer blocks — GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as flat, MLX-compatible per-group weight-only tensors.
PLE tables — an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style) assignments. The 4.7 GB BF16 PLE compresses to **0.26 GB** (indices + codebooks), decompressed on the fly with a single batched gather across all layers.
Token embeddings / LM head — flat per-group scalar quantization matched to the same runtime contract.
Bit-width schedule — chosen per module by Riemannian Constrained Optimization (RCO) under an exact byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass.

Operating points

This repo ships two release operating points, selected via the size argument:

`size`	Trade-off	Compression
`l`	More quality, larger artifact	5.62×
`m`	Smaller headline target (default)	6.40×

It also includes optional 4-bit vision and audio towers for image understanding and audio transcription.

Usage

git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm

from edge_lm import load
from mlx_vlm import stream_generate

model, tokenizer = load("TheStageAI/gemma-4-E2B-it", size="m")

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain gravity in one sentence."}],
    tokenize=False, add_generation_prompt=True,
)
for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
    print(chunk.text, end="", flush=True)

Vision and audio (loads the optional towers):

model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_vision=True)   # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_audio=True)    # audio transcription

Only the files needed for the requested size are downloaded.

Benchmarks

Every model — ours and the GGUF baselines — is dequantized to a standard BF16 checkpoint and served through vLLM, so the backend is equalized. We report MMLU-Pro (general knowledge), IFEval (instruction following), and τ²-Bench / Tau2 (multi-step tool use). For Tau2 the Gemma checkpoint acts as the agent while a fixed Qwen3-235B-A22B-2507 simulates the user.

Model	Compression	MMLU-Pro	IFEval	Tau2 (avg of 3)
BF16 (reference)	1.00×	61.85	74.68	30.67
Ours L	5.62×	54.48	74.86	22.20
Ours M	6.40×	49.85	71.53	23.45
Unsloth Q3-K-S	3.81×	48.20	64.51	18.69
Unsloth UD-Q2-K-XL	3.87×	43.17	66.54	20.23

Bold marks the best result among the compressed checkpoints in each column.

Files

File	Contents
`config.json`	Shared model config (architecture)
`model_{s,m,l}.safetensors`	Quantized decoder weights per operating point (quantization map in metadata)
`ple_{s,m,l}.safetensors`	Compact AQLM PLE codes + codebooks
`vision_tower.safetensors`	Optional 4-bit vision tower
`audio_tower.safetensors`	Optional 4-bit audio tower
`tokenizer.json`, `tokenizer_config.json`	Tokenizer

License

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) — see the references in the edge-lm write-up.