Instructions to use TheStageAI/gemma-4-E2B-it with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use TheStageAI/gemma-4-E2B-it with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("TheStageAI/gemma-4-E2B-it") config = load_config("TheStageAI/gemma-4-E2B-it") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
TheStageAI/gemma-4-E2B-it
A compressed, edge-ready variant of Google's Gemma 4 E2B (instruction-tuned), packaged for MLX on Apple Silicon Macs and iPhones. The checkpoint is roughly 7ร smaller than the original (fits in ~1.4 GB) while preserving the capabilities that matter most for on-device assistants: general world knowledge, instruction following, and tool use.
- Run it with:
TheStageAI/edge-lm - Base model:
google/gemma-4-E2B-it - Sibling release:
TheStageAI/gemma-4-E4B-it - Write-up: 7ร size reduction for Gemma 4 Edge models โ Compressing PLE architectures.
Why this exists
Gemma 4 E2B is a "2B" model by effective parameter count, but the dense checkpoint is closer to 5.1B parameters once Per-Layer Embeddings (PLE) are counted โ and in BF16 the PLE table alone is ~4.7 GB. On mobile hardware, three things block deployment: download size, runtime memory footprint (iOS enforces a ~3 GB per-app budget), and generation speed. We compress the model along its natural structure to address all three at once.
How it was compressed
- Transformer blocks โ GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as flat, MLX-compatible per-group weight-only tensors.
- PLE tables โ an AQLM-style vector-quantization codec with sensitivity-weighted (Fisher-style)
assignments. The
4.7 GB BF16 PLE compresses to **0.26 GB** (indices + codebooks), decompressed on the fly with a single batched gather across all layers. - Token embeddings / LM head โ flat per-group scalar quantization matched to the same runtime contract.
- Bit-width schedule โ chosen per module by Riemannian Constrained Optimization (RCO) under an exact byte budget; the release checkpoint is re-quantized from the dense model in one consistent GPTQ/QEP pass.
Operating points
This repo ships two release operating points, selected via the size argument:
size |
Trade-off | Compression |
|---|---|---|
l |
More quality, larger artifact | 5.62ร |
m |
Smaller headline target (default) | 6.40ร |
It also includes optional 4-bit vision and audio towers for image understanding and audio transcription.
Usage
git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm
from edge_lm import load
from mlx_vlm import stream_generate
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", size="m")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Explain gravity in one sentence."}],
tokenize=False, add_generation_prompt=True,
)
for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
print(chunk.text, end="", flush=True)
Vision and audio (loads the optional towers):
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_vision=True) # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it", include_audio=True) # audio transcription
Only the files needed for the requested size are downloaded.
Benchmarks
Every model โ ours and the GGUF baselines โ is dequantized to a standard BF16 checkpoint and served
through vLLM, so the backend is equalized. We report MMLU-Pro (general knowledge), IFEval
(instruction following), and ฯยฒ-Bench / Tau2 (multi-step tool use). For Tau2 the Gemma checkpoint
acts as the agent while a fixed Qwen3-235B-A22B-2507 simulates the user.
| Model | Compression | MMLU-Pro | IFEval | Tau2 (avg of 3) |
|---|---|---|---|---|
| BF16 (reference) | 1.00ร | 61.85 | 74.68 | 30.67 |
| Ours L | 5.62ร | 54.48 | 74.86 | 22.20 |
| Ours M | 6.40ร | 49.85 | 71.53 | 23.45 |
| Unsloth Q3-K-S | 3.81ร | 48.20 | 64.51 | 18.69 |
| Unsloth UD-Q2-K-XL | 3.87ร | 43.17 | 66.54 | 20.23 |
Bold marks the best result among the compressed checkpoints in each column.
Files
| File | Contents |
|---|---|
config.json |
Shared model config (architecture) |
model_{s,m,l}.safetensors |
Quantized decoder weights per operating point (quantization map in metadata) |
ple_{s,m,l}.safetensors |
Compact AQLM PLE codes + codebooks |
vision_tower.safetensors |
Optional 4-bit vision tower |
audio_tower.safetensors |
Optional 4-bit audio tower |
tokenizer.json, tokenizer_config.json |
Tokenizer |
License
Released under the MIT License, ยฉ 2025 thestage.ai labs. As a derivative of Google's Gemma 4, the weights are additionally subject to the Gemma Terms of Use.
Citation
If you use these checkpoints, please cite the Gemma 4 release and the methods we build on (GPTQ, QEP, AQLM, RCO) โ see the references in the edge-lm write-up.
- Downloads last month
- 202
Quantized