Gemma 4 E2B – Text-Only, 3-Bit (HQQ)

A stripped and quantised version of Google's gemma-4-E2B-it. We removed the vision encoder and audio tower, then compressed the remaining language model to 3 bits per weight using HQQ.

The result: a text-only model that scores 91% on our proprietary "trust me bro" invoice benchmark (n=200, synthetic, no peer review). 2-bit collapses entirely, so 3-bit it is.

Read the writeup: LLM Limbo: Quantising Gemma 4 to Bits and Pieces. The full experiment that produced this model: three model sizes, six quantisation methods, two modalities, and the discovery that the cliff between a working language model and pure noise is exactly one bit wide.

What was done

Stripped all vision and audio tensors from google/gemma-4-E2B-it (1,411 tensors removed, 600 kept)
Quantised the language model to 3-bit with HQQ (group_size=64, PyTorch backend)
Kept the lm_head layer in fp16 (vocabulary projection – quantising this destroys output quality)

Specs

Property	Value
Base model	google/gemma-4-E2B-it (2.3B params)
Modalities	Text only (vision + audio removed)
Quantisation	HQQ 3-bit, group_size=64
GPU memory	~6 GB
Bits per weight	3 (effective ~3.27 incl. scales + lm_head)

Benchmark: Invoice reading

Tested on 200 synthetic invoices with cent-precise ground truth (mixed number formats, VAT variants, discount structures). 160 invoices have correct arithmetic, 40 have deliberate errors the model should flag.

Metric	31B BF16	E2B BF16	E2B 3-bit (this)	E2B 2-bit
Parse rate	100%	100%	99.5%	0%
Read total correctly	100%	100%	91.0%	0%
Flagged broken invoices (n=40)	29/40	0/40	~9/40	–
False flags on correct (n=160)	30/160	3/160	~14/160	–
Avg latency	1.10s (vLLM)	0.18s (vLLM)	3.13s (HQQ/PyTorch)	–

At 2-bit, the model no longer generates coherent output – just random tokens. 3 bits is the minimum viable quantisation for a model of this size. The writeup (LLM Limbo) has the full scoreboard across precisions and model sizes.

Why this exists

We ran a series of quantisation experiments to find the practical lower bound for structured document extraction. This model is the result – the smallest configuration that still produces usable output. One step below (2-bit), the model stops generating coherent text entirely.

The reasoning, the methodology, and the surprises along the way are documented in LLM Limbo: Quantising Gemma 4 to Bits and Pieces.

Limitations

Text only – cannot process images or audio. Feed it OCR output or structured text.
HQQ format – requires the hqq library to load. Standard from_pretrained will show "UNEXPECTED" tensor warnings and produce garbage.
No CUDA acceleration – HQQ PyTorch backend dequantises on the fly. Inference is ~17x slower than vLLM-served models.
91% accuracy – not production-grade. The full-precision model or a 4-bit variant is better for real use.

How to load

# This model uses HQQ's custom tensor format.
# Standard transformers loading will NOT work correctly.
# You need the hqq library.

pip install hqq transformers torch

Citation

@misc{jngb-labs-llm-limbo-2026,
  title={LLM Limbo: Quantising Gemma 4 to Bits and Pieces},
  author={JNGB Labs},
  year={2026},
  url={https://www.jngb.online/notes/07-llm-limbo}
}

Downloads last month: 19

Safetensors

Model size

3B params

Tensor type

I64

I32

F16

Model tree for jngb-labs/gemma4-E2B-text-3bit

Base model

google/gemma-4-E2B

Finetuned

google/gemma-4-E2B-it

Finetuned

(232)

this model