Gemma 4 E2B – Text-Only, 3-Bit (HQQ)

A stripped and quantised version of Google's gemma-4-E2B-it. We removed the vision encoder and audio tower, then compressed the remaining language model to 3 bits per weight using HQQ.

The result: a text-only model that scores 91% on our proprietary "trust me bro" invoice benchmark (n=200, synthetic, no peer review). 2-bit collapses entirely, so 3-bit it is.

Read the writeup: LLM Limbo: Quantising Gemma 4 to Bits and Pieces. The full experiment that produced this model: three model sizes, six quantisation methods, two modalities, and the discovery that the cliff between a working language model and pure noise is exactly one bit wide.

What was done

  1. Stripped all vision and audio tensors from google/gemma-4-E2B-it (1,411 tensors removed, 600 kept)
  2. Quantised the language model to 3-bit with HQQ (group_size=64, PyTorch backend)
  3. Kept the lm_head layer in fp16 (vocabulary projection – quantising this destroys output quality)

Specs

Property Value
Base model google/gemma-4-E2B-it (2.3B params)
Modalities Text only (vision + audio removed)
Quantisation HQQ 3-bit, group_size=64
GPU memory ~6 GB
Bits per weight 3 (effective ~3.27 incl. scales + lm_head)

Benchmark: Invoice reading

Tested on 200 synthetic invoices with cent-precise ground truth (mixed number formats, VAT variants, discount structures). 160 invoices have correct arithmetic, 40 have deliberate errors the model should flag.

Metric 31B BF16 E2B BF16 E2B 3-bit (this) E2B 2-bit
Parse rate 100% 100% 99.5% 0%
Read total correctly 100% 100% 91.0% 0%
Flagged broken invoices (n=40) 29/40 0/40 ~9/40
False flags on correct (n=160) 30/160 3/160 ~14/160
Avg latency 1.10s (vLLM) 0.18s (vLLM) 3.13s (HQQ/PyTorch)

At 2-bit, the model no longer generates coherent output – just random tokens. 3 bits is the minimum viable quantisation for a model of this size. The writeup (LLM Limbo) has the full scoreboard across precisions and model sizes.

Why this exists

We ran a series of quantisation experiments to find the practical lower bound for structured document extraction. This model is the result – the smallest configuration that still produces usable output. One step below (2-bit), the model stops generating coherent text entirely.

The reasoning, the methodology, and the surprises along the way are documented in LLM Limbo: Quantising Gemma 4 to Bits and Pieces.

Limitations

  • Text only – cannot process images or audio. Feed it OCR output or structured text.
  • HQQ format – requires the hqq library to load. Standard from_pretrained will show "UNEXPECTED" tensor warnings and produce garbage.
  • No CUDA acceleration – HQQ PyTorch backend dequantises on the fly. Inference is ~17x slower than vLLM-served models.
  • 91% accuracy – not production-grade. The full-precision model or a 4-bit variant is better for real use.

How to load

# This model uses HQQ's custom tensor format.
# Standard transformers loading will NOT work correctly.
# You need the hqq library.

pip install hqq transformers torch

Citation

@misc{jngb-labs-llm-limbo-2026,
  title={LLM Limbo: Quantising Gemma 4 to Bits and Pieces},
  author={JNGB Labs},
  year={2026},
  url={https://www.jngb.online/notes/07-llm-limbo}
}
Downloads last month
19
Safetensors
Model size
3B params
Tensor type
I64
·
I32
·
F16
·
U8
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for jngb-labs/gemma4-E2B-text-3bit

Finetuned
(232)
this model