Gemma 4 E2B – Text-Only, 3-Bit (HQQ)
A stripped and quantised version of Google's gemma-4-E2B-it. We removed the vision encoder and audio tower, then compressed the remaining language model to 3 bits per weight using HQQ.
The result: a text-only model that scores 91% on our proprietary "trust me bro" invoice benchmark (n=200, synthetic, no peer review). 2-bit collapses entirely, so 3-bit it is.
Read the writeup: LLM Limbo: Quantising Gemma 4 to Bits and Pieces. The full experiment that produced this model: three model sizes, six quantisation methods, two modalities, and the discovery that the cliff between a working language model and pure noise is exactly one bit wide.
What was done
- Stripped all vision and audio tensors from
google/gemma-4-E2B-it(1,411 tensors removed, 600 kept) - Quantised the language model to 3-bit with HQQ (group_size=64, PyTorch backend)
- Kept the
lm_headlayer in fp16 (vocabulary projection – quantising this destroys output quality)
Specs
| Property | Value |
|---|---|
| Base model | google/gemma-4-E2B-it (2.3B params) |
| Modalities | Text only (vision + audio removed) |
| Quantisation | HQQ 3-bit, group_size=64 |
| GPU memory | ~6 GB |
| Bits per weight | 3 (effective ~3.27 incl. scales + lm_head) |
Benchmark: Invoice reading
Tested on 200 synthetic invoices with cent-precise ground truth (mixed number formats, VAT variants, discount structures). 160 invoices have correct arithmetic, 40 have deliberate errors the model should flag.
| Metric | 31B BF16 | E2B BF16 | E2B 3-bit (this) | E2B 2-bit |
|---|---|---|---|---|
| Parse rate | 100% | 100% | 99.5% | 0% |
| Read total correctly | 100% | 100% | 91.0% | 0% |
| Flagged broken invoices (n=40) | 29/40 | 0/40 | ~9/40 | – |
| False flags on correct (n=160) | 30/160 | 3/160 | ~14/160 | – |
| Avg latency | 1.10s (vLLM) | 0.18s (vLLM) | 3.13s (HQQ/PyTorch) | – |
At 2-bit, the model no longer generates coherent output – just random tokens. 3 bits is the minimum viable quantisation for a model of this size. The writeup (LLM Limbo) has the full scoreboard across precisions and model sizes.
Why this exists
We ran a series of quantisation experiments to find the practical lower bound for structured document extraction. This model is the result – the smallest configuration that still produces usable output. One step below (2-bit), the model stops generating coherent text entirely.
The reasoning, the methodology, and the surprises along the way are documented in LLM Limbo: Quantising Gemma 4 to Bits and Pieces.
Limitations
- Text only – cannot process images or audio. Feed it OCR output or structured text.
- HQQ format – requires the
hqqlibrary to load. Standardfrom_pretrainedwill show "UNEXPECTED" tensor warnings and produce garbage. - No CUDA acceleration – HQQ PyTorch backend dequantises on the fly. Inference is ~17x slower than vLLM-served models.
- 91% accuracy – not production-grade. The full-precision model or a 4-bit variant is better for real use.
How to load
# This model uses HQQ's custom tensor format.
# Standard transformers loading will NOT work correctly.
# You need the hqq library.
pip install hqq transformers torch
Citation
@misc{jngb-labs-llm-limbo-2026,
title={LLM Limbo: Quantising Gemma 4 to Bits and Pieces},
author={JNGB Labs},
year={2026},
url={https://www.jngb.online/notes/07-llm-limbo}
}
- Downloads last month
- 19