Gemma 4 E2B-it · PMRA mixed-precision GGUF

A ~3.1 GB GGUF of Google DeepMind's Gemma 4 E2B-it at the Q3_K_S size budget that is a large win over the plain quant — ~5.1 nats lower NLL than Q3_K_S, and it even edges out the bigger Q4_K_M while staying smaller. A standard GGUF for llama.cpp / Ollama, text generation.

The model

Gemma 4 E2B-it is the smallest, instruction-tuned member of Google DeepMind's Gemma 4 family. Gemma 4 models are multimodal (text + image, with audio on the small models), carry a context window of up to 256K tokens, and support 140+ languages; the family spans dense and Mixture-of-Experts designs in four sizes (E2B, E4B, 26B-A4B, 31B). The "E2B" variant is the phone-and-laptop-class entry point — small enough to run locally, which is exactly the regime where a better same-size quant matters most.

Scope of this artifact: this GGUF targets the text stack for text generation in llama.cpp; image/audio input is not exercised here. Calibrated and measured on English.

Why this build (PMRA)

A normal GGUF quant uses one format for nearly every tensor, paying the same bit-rate everywhere. Production Mixed-Rate Allocation (PMRA) measures each tensor group's contribution to quality and spends bits where they buy the most: from a low-bit Q2_K floor it promotes the groups that matter to stronger formats (Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_M) under a fixed byte budget — producing one standard GGUF at the Q3_K_S size that is far more faithful to the original weights.

Headline (Wikitext-2 validation, lower NLL is better):

NLL size
this PMRA build (knapsack) 12.88 3.094 GB
plain Q3_K_S (same budget) 17.99 3.094 GB
Q4_K_M (larger) 13.55 3.412 GB

−5.11 NLL vs the same-size Q3_K_S, and lower NLL than even the larger Q4_K_M.

Which file?

  • gemma4_e2b_it_pmra_calib_knapsack.gguf — recommended (knapsack selector).
  • gemma4_e2b_it_pmra_calib_greedy.gguf — the earlier greedy-selector build, kept for reference (the knapsack build is 0.40 NLL better).

Quick start

llama-cli -m gemma4_e2b_it_pmra_calib_knapsack.gguf \
  -p "Write a short hello from PMRA." -n 80

Needs a recent llama.cpp build (or Ollama) with Gemma 4 support. ~3.1 GB on disk; runs on CPU.

Footprint

  • file: gemma4_e2b_it_pmra_calib_knapsack.gguf
  • size: 3,110,215,968 bytes (≈ 3.11 GB) · payload 3,094,397,068 bytes · tensor count 601
  • file bpw: 5.354 · payload bpw: 5.327
  • SHA-256: a5a80f2628e236a228f2016bcc3ac660a268f2c8757d21d901095c74b60e3d97
  • tensor reload mismatches: 0
  • local llama.cpp smoke (build a8fd165): 30.5 prompt tok/s · 10.6 decode tok/s

general.file_type is inherited from the metadata source (GGUF has no enum for this mixed allocation); use the embedded pmra.* metadata and artifact_report_knapsack.json for payload accounting.

Benchmarks

Calibration: Wikitext-2-raw train. Evaluation: Wikitext-2-raw validation. Lower NLL is better; mix/quant rows are at matched size.

Variant NLL Payload bpw Payload bytes
fp16 reference 14.381222 16.000000 9,294,899,782
Q2_K (low source) 20.376913 5.118105 2,973,267,084
Q3_K_S (target / control) 17.993582 5.326613 3,094,396,044
Q3_K_M 15.619944 5.483489 3,185,529,996
Q3_K_L 15.756687 5.622925 3,266,532,492
IQ4_XS 16.043206 5.670221 3,294,008,460
Q4_K_M 13.549753 5.873431 3,412,059,276
same-budget random 20.488594 5.326613 3,094,396,044
PMRA c2_calib_greedy_mixed 13.281400 5.326291 3,094,208,652
PMRA c2_calib_knapsack_mixed 12.878809 5.326613 3,094,396,044
  • knapsack vs Q3_K_S target: −5.114774 NLL, matched payload
  • knapsack vs same-budget random: −7.609785 NLL
  • knapsack vs greedy: −0.402591 NLL
  • selected tensor groups: 204

How it was built

  • base: google/gemma-4-E2B-it
  • GGUF sources: mradermacher/gemma-4-E2B-it-GGUF
  • tensor profile gemma4 · selector c2_calib_knapsack_mixed
  • low source Q2_K → target/control Q3_K_S; promotion menu Q3_K_M, Q3_K_L, IQ4_XS, Q4_K_M

Source mix

Source Tensors Payload bytes
Q2_K 397 2,637,615,244
Q3_K_M 84 233,001,984
Q4_K_M 56 119,282,688
IQ4_XS 40 83,140,608
Q3_K_L 24 21,356,544

Files

  • gemma4_e2b_it_pmra_calib_knapsack.gguf (recommended), gemma4_e2b_it_pmra_calib_greedy.gguf
  • artifact_report_knapsack.json / .md, selector_result_knapsack.json / .md
  • llama_cli_smoke_knapsack.log, GEMMA4_E2B_IT_KNAPSACK_RELEASE.md, and the prior greedy-release reports

Attribution & license

Derived from google/gemma-4-E2B-it (Google DeepMind) and public GGUF quantizations from mradermacher/gemma-4-E2B-it-GGUF, via llama.cpp GGUF tooling. Released under the Gemma 4 / Apache-2.0 terms; preserve upstream model, license, and quantization attribution when redistributing.

Method + reproduction: https://github.com/asystemoffields/PMRA

Limitations

  • Experimental, English-calibrated; broader multilingual and multimodal evaluation is future work.
  • The selector is calibration-greedy/knapsack at tensor granularity; finer allocation may improve the frontier further.
Downloads last month
2,157
GGUF
Model size
5B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Asystemoffields/gemma-4-E2B-it-PMRA-GGUF

Quantized
(213)
this model

Collection including Asystemoffields/gemma-4-E2B-it-PMRA-GGUF