TheStageAI/gemma-4-E4B-it

A compressed, edge-ready variant of Google's Gemma 4 E4B (instruction-tuned), packaged for MLX on Apple Silicon Macs and iPhones. The checkpoint fits in ~2.6 GB — small enough to download quickly and stay within mobile memory budgets — while preserving the capabilities that matter most for on-device assistants: general world knowledge, instruction following, and tool use.

Run it with: TheStageAI/edge-lm
Base model: google/gemma-4-E4B-it
Sibling release: TheStageAI/gemma-4-E2B-it
Write-up: 7× size reduction for Gemma 4 Edge models — Compressing PLE architectures.

Why this exists

Gemma 4 E4B is a "4B" model by effective parameter count, but the dense checkpoint is closer to 8B parameters once Per-Layer Embeddings (PLE) are counted — and in BF16 the PLE table dominates the footprint. On mobile hardware, three things block deployment: download size, runtime memory footprint (iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural structure to address all three at once.

How it was compressed

Transformer blocks — GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as flat, MLX-compatible per-group weight-only tensors.
PLE tables — an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style) assignments, decompressed on the fly with a single batched gather across all layers.
Token embeddings / LM head — flat per-group scalar quantization matched to the same runtime contract.
Bit-width schedule — chosen per module by Riemannian Constrained Optimization (RCO) under an exact byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass.

Operating points

This repo ships two release operating points, selected via the size argument:

`size`	Trade-off	Compression
`l`	More quality, larger artifact	4.64×
`m`	Smaller headline target (default)	5.60×

It also includes optional 4-bit vision and audio towers for image understanding and audio transcription.

Usage

git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm

from edge_lm import load
from mlx_vlm import stream_generate

model, tokenizer = load("TheStageAI/gemma-4-E4B-it", size="l")  # use "m" for the smaller target

prompt = tokenizer.apply_chat_template(
    [{"role": "user", "content": "Explain gravity in one sentence."}],
    tokenize=False, add_generation_prompt=True,
)
for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
    print(chunk.text, end="", flush=True)

Vision and audio (loads the optional towers):

model, tokenizer = load("TheStageAI/gemma-4-E4B-it", include_vision=True)   # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E4B-it", include_audio=True)    # audio transcription

Only the files needed for the requested size are downloaded.

Benchmarks

Every model — ours and the GGUF baselines — is dequantized to a standard BF16 checkpoint and served through vLLM, so the backend is equalized. We report MMLU-Pro (general knowledge), IFEval (instruction following), and τ²-Bench / Tau2 (multi-step tool use). For Tau2 the Gemma checkpoint acts as the agent while a fixed Qwen3-235B-A22B-2507 simulates the user.

Model	Compression	MMLU-Pro	IFEval	Tau2
BF16 (reference)	1.00×	70.49	81.33	37.19
Ours L	4.64×	67.41	81.52	33.25
Ours M	5.60×	63.54	80.78	29.04
Unsloth Q3-K-S	3.90×	63.66	77.08	30.47
Unsloth UD-Q2-K-XL	4.01×	58.69	79.67	22.91

Bold marks the best result among the compressed checkpoints in each column.

Files

File	Contents
`config.json`	Shared model config (architecture)
`model_{s,m,l}.safetensors`	Quantized decoder weights per operating point (quantization map in metadata)
`ple_{s,m,l}.safetensors`	Compact AQLM PLE codes + codebooks
`vision_tower.safetensors`	Optional 4-bit vision tower
`audio_tower.safetensors`	Optional 4-bit audio tower
`tokenizer.json`, `tokenizer_config.json`	Tokenizer

License

Citation

If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) — see the references in the edge-lm write-up.