🧊 Google/Gemma-4-31B-it · imatrix · GGUF

imatrix (hybrid)
📦 10.2 · 13.4 · 15.6 · 19.8 GiB IQ2_M · IQ3_M · IQ4_XS · Q5_K_S 🏗️ llama.cpp f3e1828 🏅 Agent · 100% patch 👁️ Text + Image · mmproj 772 MB ⚡ MTP drafter · 88% accept @ n=1 🎯 Q5_K_S · text+vision+MTP in 24 GB

🧊 What this is

An aggressively compressed (under 3 bpw) IQ2_M quantization of google/gemma-4-31B-it, calibrated with a hybrid imatrix built from real coding/tool-use logs. Runs in vanilla llama.cpp / Ollama / LM Studio — no custom runtime, no extra inference cost. Higher-bit IQ3_M (3.76 bpw), IQ4_XS (4.36 bpw), and Q5_K_S (5.55 bpw — the highest-fidelity build, KLD 0.025 vs FP16) builds are also included for users with more VRAM.

👁️ Now with vision (text + image input)

Gemma 4 is natively multimodal. This repo ships the model's vision tower as a separate mmproj-gemma-4-31B-it-Q8_0.gguf (772 MB, SigLIP-style 27-layer encoder, Q8_0 — visually lossless vs F16 at ⅔ the size). Pair it with any of the four quant files via --mmproj and the model can see images — describe screenshots, read diagrams, answer questions about a UI, and so on. The text quant is unchanged; vision adds only the small mmproj. See Usage → Vision below.

🎯 The 24 GB build (Q5_K_S)

The new Q5_K_S build (5.55 bpw, 19.85 GiB) is sized so a single 24 GB GPU can host the full stack at once: the 5-bit text trunk (19.85) + the Q8 MTP drafter (0.48) + the Q8 vision mmproj (0.75) = ~21.1 GiB, leaving ~2.9 GiB for a real KV cache (more with --cache-type-k q8_0 --cache-type-v q8_0). Near-FP16 quality (KLD 0.025), images, and speculative decoding — all on one consumer card, no offload.

📉 ~5.6× smaller10.17 GiB on disk vs 57.2 GiB FP16, at ~2.85 bits/weight.
🤖 Actually agentic47% pass / 100% patch on a 10-instance agentic SWE-rebench holdout (IQ4_XS). IQ2_M still resolved 40% — best of every sub-3-bpw arm tested.
🛠️ Standard GGUFLoads anywhere llama.cpp runs. No patches, kernels, or forks.

📊 Unified benchmark & quality table

Agentic metrics from a SWE-rebench holdout run through the OpenAI Agents SDK (10 instances × 3 reps). Static metrics (PPL / KLD / top-p) measured against FP16 on a held-out eval corpus at ctx=4096. KLD column is median for robustness to per-token tails.

Metric FP16 (ref) Q5_K_S IQ4_XS IQ3_M IQ2_M
File Q5_K_S.gguf IQ4_XS.gguf IQ3_M.gguf IQ2_M.gguf
Quality - ⭐⭐⭐⭐ ⭐⭐⭐ ⭐⭐
BPW 16.0 5.55 4.36 3.76 2.85
Size (GiB) 57.20 19.85 15.59 13.43 10.17
🤖 Pass Rate 40±8% 47±5% 33±12% 40±8%
🤖 Patch Rate 100% 100% 100% 100%
🤖 Tool Errors 11±2% 10±3% 16±2% 16±1%
🤖 Mean Tokens 663K±111K 575K±70K 483K±75K 558K±94K
📐 PPL 215.5 256.5 319.4 734.1 1958.7
📐 KLD (med) 0.000 0.025 0.073 0.435 1.571
📐 same_top_p 100.0% 85.5% 78.8% 63.1% 46.6%

Q5_K_S resolves 40% of the holdout (tying IQ2_M, ahead of IQ3_M) at 100% patch and a low 11% tool-error rate (on par with IQ4_XS, well under the IQ2/IQ3 arms' 16%) — while being the highest-fidelity build on the static metrics (KLD 0.025). IQ4_XS remains the agentic leader at 47%; the gap is within run-to-run noise.

📌 Sampling & methodology details

Sampling: temperature=0.25, top_p=0.95, top_k=20, max_tokens=32768, ctx=131072, thinking=false. Run on Apple Silicon (Metal); SWE-rebench linux/amd64 images under emulation, so wall-clock is relative, not absolute.

Pass Rate = gold tests pass after agent's patch (real resolution). Patch Rate = non-empty diff produced.


🔬 How it was made

  • Hybrid imatrix — activation energy E[a²] mixed with weight-column energy ‖W[:,i]‖²·E[a²] per tensor, collected over real coding/tool-use logs + wiki.test.raw via quant-tuner.
  • IQ2_M codebook — 2-bit E8-lattice non-uniform codes with per-tensor tier bumps (attention output, early ffn_down get more bits). llama-quantize decides the mix.
  • Vision mmproj — the model's SigLIP-style vision tower (27 layers, 280 soft tokens/image) exported separately at Q8_0 with convert_hf_to_gguf.py --mmproj (visually lossless, 772 MB), so the encoder stays high-precision while the text path runs at 2 bits. No audio encoder is shipped (the source has none).
  • Disjoint splits — calibration (imatrix), validation (per-tensor α gate), and eval (PPL/KLD) come from different corpora; the SWE-rebench holdout never appears in any calibration set.
  • Toolchain: quant-tuner for imatrix calibration, llama.cpp @ f3e1828 for final quantization. Calibration logs mined with LogMiner.

🚀 Usage

Ollama

ollama run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

llama.cpp (GPU)

# Build with CUDA (-DGGML_CUDA=OFF for CPU/Metal)
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/

# Run the server
./llama-server \
    --model gemma-4-31B-it-IQ2_M.gguf \
    --ctx-size 16384 --n-gpu-layers 999 --split-mode layer \
    --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
    --parallel 1 --batch-size 2048 --ubatch-size 512 \
    --host 0.0.0.0 --port 1234

OpenAI-compatible API (Python)

import json, urllib.request

def ask(content, max_tokens=256):
    body = {
        "messages": [{"role": "user", "content": content}],
        "max_tokens": max_tokens,
        # Gemma 4 is a thinking model — disable or raise max_tokens
        "chat_template_kwargs": {"enable_thinking": False},
    }
    req = urllib.request.Request(
        "http://127.0.0.1:1234/v1/chat/completions",
        json.dumps(body).encode(),
        {"Content-Type": "application/json"},
    )
    return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]

print(ask("What is 1+1?"))

🖼️ Vision (text + image)

Gemma 4 is natively multimodal. The vision tower ships separately as mmproj-gemma-4-31B-it-Q8_0.gguf (772 MB) so you only download it if you need images. It pairs with any of the four quant files (IQ2_M / IQ3_M / IQ4_XS / Q5_K_S) — the text weights are identical; the mmproj just adds the SigLIP encoder + projector.

One-shot from the CLI (llama-mtmd-cli):

./llama-mtmd-cli \
    --model gemma-4-31B-it-IQ4_XS.gguf \
    --mmproj mmproj-gemma-4-31B-it-Q8_0.gguf \
    --image screenshot.png \
    --jinja -ngl 999 --temp 0.2 -n 256 \
    -p "Describe this image. What's in it?"

--jinja is required — Gemma 4's chat template is Jinja-based and the CLI aborts without it. --image can be repeated for multi-image prompts; URLs work too.

⚠️ Thinking + the CLI. Gemma 4 is a reasoning model. From llama-mtmd-cli, leave thinking on and give it enough budget (-n 800+) so the answer survives the reasoning preamble — the --chat-template-kwargs '{"enable_thinking":false}' flag currently returns an empty completion on the CLI path. To get a clean, reasoning-free answer, disable thinking over the HTTP server instead (below).

Vision server — host the quant with the mmproj attached (this is exactly how the worked example above was generated). --jinja is required; the vision tower is loaded via --mmproj:

./llama-server \
    -m gemma-4-31B-it-IQ4_XS.gguf \
    --mmproj mmproj-gemma-4-31B-it-Q8_0.gguf \
    --jinja --ctx-size 8192 --n-gpu-layers 999 \
    --host 127.0.0.1 --port 1234

Vision is purely additive — drop the --mmproj flag and you're back to the identical text-only model.

The OpenAI-compatible /v1/chat/completions endpoint then accepts image_url content parts. With chat_template_kwargs.enable_thinking=false the server returns just the answer (no reasoning preamble). This is the exact call used to generate the mecha prompt above:

import base64, json, urllib.request

with open("mecha.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

body = {
    "messages": [{"role": "user", "content": [
        {"type": "text", "text": (
            "Look at this image and write a single, detailed text-to-image "
            "generation prompt that would recreate it. Cover the subject, colors, "
            "pose, lighting, style, and background. Respond with only the prompt."
        )},
        {"type": "image_url",
         "image_url": {"url": f"data:image/png;base64,{b64}"}},
    ]}],
    "max_tokens": 400,
    "temperature": 0.3,
    "chat_template_kwargs": {"enable_thinking": False},
}
req = urllib.request.Request(
    "http://127.0.0.1:1234/v1/chat/completions",
    json.dumps(body).encode(),
    {"Content-Type": "application/json"},
)
print(json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"])

Worked example — same image, all three quants

Each shipped quant was handed the mecha on the right and asked to "write a single, detailed text-to-image generation prompt that would recreate it".

anime mecha — white/lavender armor, neon-green energy sword, purple railgun
Quant BPW Generated text-to-image prompt
IQ4_XS 4.36 A high-detail anime-style illustration of a heavy armored mecha robot standing in a powerful, wide-legged stance against a solid black background. The mecha features bulky, angular white and light-grey plating with deep purple accents and glowing neon green energy vents and eyes. In its right hand, it holds a massive, oversized purple railgun, and in its left hand, a glowing neon green energy blade that casts a vibrant light across the armor. The lighting is dramatic and cinematic, with strong rim lighting and high contrast, emphasizing the metallic textures and the bioluminescent glow of the green components. The art style is clean, modern mecha anime with sharp line work and smooth digital shading.
IQ3_M 3.76 Full-body shot of a heavily armored futuristic mecha robot in a powerful standing pose, designed in a detailed 90s retro anime style. The mecha features a bulky, angular white and light-grey armor plating with deep purple accents and glowing neon green energy vents and sensors integrated into the chest, legs, and eyes. It is armed with a massive, oversized purple sniper rifle held in its right hand and a glowing neon green energy sword held in its left. The lighting is dramatic and moody, with a strong rim light highlighting the silhouette against a solid black background. High contrast, clean cel-shaded lines, sharp mechanical detailing, cinematic composition.
IQ2_M 2.85 Full-body shot of a futuristic mecha robot in a stylized anime aesthetic, featuring a heavy armored chassis in a palette of white, grey, and deep purple. The mecha is posed in a powerful stance, holding a large purple futuristic firearm in its right hand and a glowing neon-green energy blade in its left hand. The design includes glowing mint-green accents and circuitry lines across the chest, legs, and head. The lighting is dramatic and moody, with a strong rim lighting and a dark, atmospheric background with subtle purple gradients and a slight digital scanline texture. High-contrast cel-shaded style with clean lines and sharp metallic reflections.

⚡ Speculative decoding (MTP drafter)

This repo also bundles a multi-token-prediction (MTP) drafter at the repo root, mtp-gemma-4-31B-it.gguf (499 MB, Q8_0) — a self-quantized conversion of google/gemma-4-31B-it-assistant (arch gemma4-assistant, nextn_predict_layers = 4). It predicts up to 4 future tokens from the trunk's hidden state so llama.cpp can verify them in a single forward pass. One drafter serves every quant — it keys off the trunk's hidden size / vocab, not the quantization — and the trunk GGUFs are never modified (it loads as a separate --model-draft).

Acceptance rate vs draft depth (--spec-draft-n-max). Fraction of drafted tokens the trunk accepted, swept over n = 1…4 for each quant (5 mixed coding/reasoning prompts × 200 tokens, temperature=0.3, thinking off; scripts/exp046_mtp_acceptance.py, Q5_K_S via scripts/exp047_q5ks_mtp.py — identical method). Higher n drafts more tokens per step but lowers per-token acceptance — pick n for your hardware (speed isn't reported here, it's machine-specific):

Quant n=1 n=2 n=3 n=4
Q5_K_S 87.9% 81.8% 73.0% 66.0%
IQ4_XS 86.5% 80.2% 68.6% 64.0%
IQ3_M 87.2% 79.1% 70.8% 64.6%
IQ2_M 83.1% 77.1% 70.6% 61.4%

Acceptance holds up across all four trunks — the highest-fidelity Q5_K_S leads at every draft depth (87.9% at n=1, still 66.0% at n=4), and even the 2-bit IQ2_M accepts 83% of single-token drafts.

Usage — add --model-draft + --spec-type draft-mtp to the server command:

./llama-server \
    -m gemma-4-31B-it-IQ4_XS.gguf \
    --model-draft mtp-gemma-4-31B-it.gguf \
    --spec-type draft-mtp --spec-draft-n-max 4 \
    --jinja -ngl 999 -fa on \
    --host 127.0.0.1 --port 1234

The drafter lives at the repo root so --spec-type draft-mtp auto-discovers it when you load the trunk with -hf (no manual --model-draft needed): llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ4_XS --spec-type draft-mtp --spec-draft-n-max 4.

Needs a llama.cpp build with gemma4-assistant + draft-mtp support (any master after 2026-06-07; this release used @ f3e1828). The drafter pairs with the vision --mmproj too — text, image, and speculative decoding can all be active at once.


🪪 License & attribution

Downloads last month
17,578
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

2-bit

3-bit

4-bit

5-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF

Quantized
(247)
this model