Instructions to use TheStageAI/gemma-4-E2B-it-qat with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use TheStageAI/gemma-4-E2B-it-qat with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("TheStageAI/gemma-4-E2B-it-qat") config = load_config("TheStageAI/gemma-4-E2B-it-qat") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
TheStageAI/gemma-4-E2B-it-qat
A compressed, edge-ready variant of Google's Gemma 4 E2B instruction model, rebuilt from
Google's QAT-trained BF16 weights and packaged for
edge-lm on Apple Silicon Macs and iPhones. The m
checkpoint fits in 1.44 GB; the l checkpoint fits in 1.72 GB and keeps more quality for a
small size increase.
- Run it with:
TheStageAI/edge-lm - Compression source:
google/gemma-4-E2B-it-qat-q4_0-unquantized - BF16 reference:
google/gemma-4-E2B-it - GGUF release:
TheStageAI/gemma-4-E2B-it-qat-GGUF
Use this repo when artifact size and Apple runtime efficiency matter most. For portable llama.cpp deployment, use the GGUF sibling release.
Why this exists
Gemma 4 Edge models are compact by effective parameter count, but their dense checkpoints are much larger once Per-Layer Embeddings (PLE) are counted. For on-device deployment, the blocking factors are download size, runtime memory footprint, and generation speed.
Google's QAT-trained BF16 checkpoint gives the same production compression pipeline a better starting
point. In our measurements, the QAT source improves weight-only distortion and KL under the same
byte budgets, while public benchmark deltas remain smaller than the KL movement. The native
edge-lm format keeps the custom decoder and PLE codecs that make the smallest artifacts possible.
How it was compressed
We use the same production pipeline as the previous Gemma 4 E2B release, with the dense initialization switched from the original BF16 checkpoint to Google's QAT-trained BF16 checkpoint.
- Transformer blocks - GPTQ with Quantization Error Propagation (QEP) and range clipping, emitted as MLX-compatible per-group weight-only tensors.
- PLE tables - an AQLM-style vector-quantization codec with robust sensitivity-weighted assignments, stored as compact indices and codebooks and decompressed on the fly.
- Token embeddings / LM head - flat per-group scalar quantization matched to the same runtime contract as the decoder.
- Bit-width schedule - the production
mandlschedules selected by RCO under fixed byte budgets, then requantized from the QAT BF16 source in one consistent pass.
Operating points
This repo ships two operating points, selected with the size argument:
size |
Trade-off | Artifact size | Compression vs BF16 | Transformer | PLE |
|---|---|---|---|---|---|
m |
Compact target | 1.44 GB | 7.1x | w3gs32 | robust AQLM |
l |
Higher-quality target | 1.72 GB | 5.9x | w4gs32 | robust AQLM |
The m checkpoint is the smallest production target. The l checkpoint spends about 280 MB more
on decoder precision and recovers a larger share of the BF16 reference quality.
Usage
git clone https://github.com/TheStageAI/edge-lm.git
pip install -e edge-lm
from edge_lm import load
from mlx_vlm import stream_generate
model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", size="m")
prompt = tokenizer.apply_chat_template(
[{"role": "user", "content": "Explain gravity in one sentence."}],
tokenize=False,
add_generation_prompt=True,
)
for chunk in stream_generate(model, tokenizer, prompt, max_tokens=128):
print(chunk.text, end="", flush=True)
Vision and audio towers can be loaded on demand:
model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", include_vision=True) # image understanding
model, tokenizer = load("TheStageAI/gemma-4-E2B-it-qat", include_audio=True) # audio transcription
Only the files required for the requested size and modalities are downloaded.
Benchmarks
Every checkpoint is dequantized to a standard BF16 evaluation path and served through vLLM, so the
backend is equalized across native and GGUF releases. IFEval p/i means prompt strict / instruction
strict, using the corrected public recipe with max_gen_toks=1280.
| Model | Size | Compression | MMLU-Pro | IFEval p/i |
|---|---|---|---|---|
| BF16 reference | 10.21 GB | 1.0x | 61.85 | 75.23 / 82.37 |
Ours m |
1.44 GB | 7.1x | 47.91 | 75.42 / 83.09 |
Ours l |
1.72 GB | 5.9x | 54.45 | 76.71 / 83.69 |
MMLU-Pro is the official checkpoint-wise vLLM route with Gemma chat formatting and thinking enabled.
Files
| File | Contents |
|---|---|
config.json |
Shared Gemma 4 architecture config |
model_m.safetensors, model_l.safetensors |
Quantized decoder weights; each file stores its quantization map in metadata |
ple_m.safetensors, ple_l.safetensors |
Compact PLE payloads |
vision_tower.safetensors |
Optional 4-bit vision tower |
audio_tower.safetensors |
Optional 4-bit audio tower |
tokenizer.json, tokenizer_config.json |
Tokenizer files |
License
Released under the MIT License. As a derivative of Gemma, the weights are also subject to the Gemma Terms of Use.
Citation
If you use these checkpoints, please cite the Gemma 4 release and the methods we build on
(GPTQ, QEP, AQLM, RCO) - see the references in the
edge-lm write-up.
- Downloads last month
- -
Quantized
Model tree for TheStageAI/gemma-4-E2B-it-qat
Base model
google/gemma-4-E2B