Instructions to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="pearsonkyle/gemma4-31b-imatrix-mtp-GGUF") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("pearsonkyle/gemma4-31b-imatrix-mtp-GGUF", dtype="auto") - llama-cpp-python
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="pearsonkyle/gemma4-31b-imatrix-mtp-GGUF", filename="gemma-4-31B-it-IQ2_M.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M # Run inference directly in the terminal: llama cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M # Run inference directly in the terminal: llama cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M # Run inference directly in the terminal: ./llama-cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M # Run inference directly in the terminal: ./build/bin/llama-cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Use Docker
docker model run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
- LM Studio
- Jan
- vLLM
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
- SGLang
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Ollama
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Ollama:
ollama run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
- Unsloth Studio
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF to start chatting
- Pi
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Docker Model Runner:
docker model run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
- Lemonade
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
Run and chat with the model
lemonade run user.gemma4-31b-imatrix-mtp-GGUF-IQ2_M
List all available models
lemonade list
📊 Unified benchmark & quality table
Agentic metrics from a SWE-rebench holdout run through the OpenAI Agents SDK (10 instances × 3 reps). Static metrics (PPL / KLD / top-p) measured against FP16 on a held-out eval corpus at ctx=4096. KLD column is median for robustness to per-token tails.
| Metric | FP16 (ref) | Q5_K_S | IQ4_XS | IQ3_M | IQ2_M |
|---|---|---|---|---|---|
| File | — | Q5_K_S.gguf | IQ4_XS.gguf | IQ3_M.gguf | IQ2_M.gguf |
| Quality | - | ⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ | ⭐ |
| BPW | 16.0 | 5.55 | 4.36 | 3.76 | 2.85 |
| Size (GiB) | 57.20 | 19.85 | 15.59 | 13.43 | 10.17 |
| 🤖 Pass Rate | — | 40±8% | 47±5% | 33±12% | 40±8% |
| 🤖 Patch Rate | — | 100% | 100% | 100% | 100% |
| 🤖 Tool Errors | — | 11±2% | 10±3% | 16±2% | 16±1% |
| 🤖 Mean Tokens | — | 663K±111K | 575K±70K | 483K±75K | 558K±94K |
| 📐 PPL | 215.5 | 256.5 | 319.4 | 734.1 | 1958.7 |
| 📐 KLD (med) | 0.000 | 0.025 | 0.073 | 0.435 | 1.571 |
| 📐 same_top_p | 100.0% | 85.5% | 78.8% | 63.1% | 46.6% |
Q5_K_S resolves 40% of the holdout (tying IQ2_M, ahead of IQ3_M) at 100% patch and a low 11% tool-error rate (on par with IQ4_XS, well under the IQ2/IQ3 arms' 16%) — while being the highest-fidelity build on the static metrics (KLD 0.025). IQ4_XS remains the agentic leader at 47%; the gap is within run-to-run noise.
📌 Sampling & methodology details
Sampling:
temperature=0.25, top_p=0.95, top_k=20, max_tokens=32768, ctx=131072, thinking=false. Run on Apple Silicon (Metal); SWE-rebench linux/amd64 images under emulation, so wall-clock is relative, not absolute.Pass Rate = gold tests pass after agent's patch (real resolution). Patch Rate = non-empty diff produced.
🔬 How it was made
- Hybrid imatrix — activation energy
E[a²]mixed with weight-column energy‖W[:,i]‖²·E[a²]per tensor, collected over real coding/tool-use logs +wiki.test.rawvia quant-tuner. - IQ2_M codebook — 2-bit E8-lattice non-uniform codes with per-tensor tier bumps (attention output, early
ffn_downget more bits).llama-quantizedecides the mix. - Vision mmproj — the model's SigLIP-style vision tower (27 layers, 280 soft tokens/image) exported separately at Q8_0 with
convert_hf_to_gguf.py --mmproj(visually lossless, 772 MB), so the encoder stays high-precision while the text path runs at 2 bits. No audio encoder is shipped (the source has none). - Disjoint splits — calibration (imatrix), validation (per-tensor α gate), and eval (PPL/KLD) come from different corpora; the SWE-rebench holdout never appears in any calibration set.
- Toolchain: quant-tuner for imatrix calibration, llama.cpp
@ f3e1828for final quantization. Calibration logs mined with LogMiner.
🚀 Usage
Ollama
ollama run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
llama.cpp (GPU)
# Build with CUDA (-DGGML_CUDA=OFF for CPU/Metal)
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/
# Run the server
./llama-server \
--model gemma-4-31B-it-IQ2_M.gguf \
--ctx-size 16384 --n-gpu-layers 999 --split-mode layer \
--flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
--parallel 1 --batch-size 2048 --ubatch-size 512 \
--host 0.0.0.0 --port 1234
OpenAI-compatible API (Python)
import json, urllib.request
def ask(content, max_tokens=256):
body = {
"messages": [{"role": "user", "content": content}],
"max_tokens": max_tokens,
# Gemma 4 is a thinking model — disable or raise max_tokens
"chat_template_kwargs": {"enable_thinking": False},
}
req = urllib.request.Request(
"http://127.0.0.1:1234/v1/chat/completions",
json.dumps(body).encode(),
{"Content-Type": "application/json"},
)
return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]
print(ask("What is 1+1?"))
🖼️ Vision (text + image)
Gemma 4 is natively multimodal. The vision tower ships separately as
mmproj-gemma-4-31B-it-Q8_0.gguf (772 MB) so you only download it if you need
images. It pairs with any of the four quant files (IQ2_M / IQ3_M / IQ4_XS / Q5_K_S) —
the text weights are identical; the mmproj just adds the SigLIP encoder + projector.
One-shot from the CLI (llama-mtmd-cli):
./llama-mtmd-cli \
--model gemma-4-31B-it-IQ4_XS.gguf \
--mmproj mmproj-gemma-4-31B-it-Q8_0.gguf \
--image screenshot.png \
--jinja -ngl 999 --temp 0.2 -n 256 \
-p "Describe this image. What's in it?"
--jinjais required — Gemma 4's chat template is Jinja-based and the CLI aborts without it.--imagecan be repeated for multi-image prompts; URLs work too.⚠️ Thinking + the CLI. Gemma 4 is a reasoning model. From
llama-mtmd-cli, leave thinking on and give it enough budget (-n 800+) so the answer survives the reasoning preamble — the--chat-template-kwargs '{"enable_thinking":false}'flag currently returns an empty completion on the CLI path. To get a clean, reasoning-free answer, disable thinking over the HTTP server instead (below).
Vision server — host the quant with the mmproj attached (this is exactly how the
worked example above was generated). --jinja is required; the vision tower is loaded
via --mmproj:
./llama-server \
-m gemma-4-31B-it-IQ4_XS.gguf \
--mmproj mmproj-gemma-4-31B-it-Q8_0.gguf \
--jinja --ctx-size 8192 --n-gpu-layers 999 \
--host 127.0.0.1 --port 1234
Vision is purely additive — drop the --mmproj flag and you're back to the identical text-only model.
The OpenAI-compatible /v1/chat/completions endpoint then accepts image_url content
parts. With chat_template_kwargs.enable_thinking=false the server returns just the
answer (no reasoning preamble). This is the exact call used to generate the mecha
prompt above:
import base64, json, urllib.request
with open("mecha.png", "rb") as f:
b64 = base64.b64encode(f.read()).decode()
body = {
"messages": [{"role": "user", "content": [
{"type": "text", "text": (
"Look at this image and write a single, detailed text-to-image "
"generation prompt that would recreate it. Cover the subject, colors, "
"pose, lighting, style, and background. Respond with only the prompt."
)},
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{b64}"}},
]}],
"max_tokens": 400,
"temperature": 0.3,
"chat_template_kwargs": {"enable_thinking": False},
}
req = urllib.request.Request(
"http://127.0.0.1:1234/v1/chat/completions",
json.dumps(body).encode(),
{"Content-Type": "application/json"},
)
print(json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"])
Worked example — same image, all three quants
Each shipped quant was handed the mecha on the right and asked to "write a single, detailed text-to-image generation prompt that would recreate it".
| Quant | BPW | Generated text-to-image prompt |
|---|---|---|
| IQ4_XS | 4.36 | A high-detail anime-style illustration of a heavy armored mecha robot standing in a powerful, wide-legged stance against a solid black background. The mecha features bulky, angular white and light-grey plating with deep purple accents and glowing neon green energy vents and eyes. In its right hand, it holds a massive, oversized purple railgun, and in its left hand, a glowing neon green energy blade that casts a vibrant light across the armor. The lighting is dramatic and cinematic, with strong rim lighting and high contrast, emphasizing the metallic textures and the bioluminescent glow of the green components. The art style is clean, modern mecha anime with sharp line work and smooth digital shading. |
| IQ3_M | 3.76 | Full-body shot of a heavily armored futuristic mecha robot in a powerful standing pose, designed in a detailed 90s retro anime style. The mecha features a bulky, angular white and light-grey armor plating with deep purple accents and glowing neon green energy vents and sensors integrated into the chest, legs, and eyes. It is armed with a massive, oversized purple sniper rifle held in its right hand and a glowing neon green energy sword held in its left. The lighting is dramatic and moody, with a strong rim light highlighting the silhouette against a solid black background. High contrast, clean cel-shaded lines, sharp mechanical detailing, cinematic composition. |
| IQ2_M | 2.85 | Full-body shot of a futuristic mecha robot in a stylized anime aesthetic, featuring a heavy armored chassis in a palette of white, grey, and deep purple. The mecha is posed in a powerful stance, holding a large purple futuristic firearm in its right hand and a glowing neon-green energy blade in its left hand. The design includes glowing mint-green accents and circuitry lines across the chest, legs, and head. The lighting is dramatic and moody, with a strong rim lighting and a dark, atmospheric background with subtle purple gradients and a slight digital scanline texture. High-contrast cel-shaded style with clean lines and sharp metallic reflections. |
⚡ Speculative decoding (MTP drafter)
This repo also bundles a multi-token-prediction (MTP) drafter at the repo root,
mtp-gemma-4-31B-it.gguf
(499 MB, Q8_0) — a self-quantized conversion of
google/gemma-4-31B-it-assistant
(arch gemma4-assistant, nextn_predict_layers = 4). It predicts up to 4 future
tokens from the trunk's hidden state so llama.cpp can verify them in a single forward
pass. One drafter serves every quant — it keys off the trunk's hidden size / vocab,
not the quantization — and the trunk GGUFs are never modified (it loads as a separate
--model-draft).
Acceptance rate vs draft depth (--spec-draft-n-max). Fraction of drafted tokens
the trunk accepted, swept over n = 1…4 for each quant (5 mixed coding/reasoning
prompts × 200 tokens, temperature=0.3, thinking off; scripts/exp046_mtp_acceptance.py,
Q5_K_S via scripts/exp047_q5ks_mtp.py — identical method).
Higher n drafts more tokens per step but lowers per-token acceptance — pick n for your
hardware (speed isn't reported here, it's machine-specific):
| Quant | n=1 | n=2 | n=3 | n=4 |
|---|---|---|---|---|
| Q5_K_S | 87.9% | 81.8% | 73.0% | 66.0% |
| IQ4_XS | 86.5% | 80.2% | 68.6% | 64.0% |
| IQ3_M | 87.2% | 79.1% | 70.8% | 64.6% |
| IQ2_M | 83.1% | 77.1% | 70.6% | 61.4% |
Acceptance holds up across all four trunks — the highest-fidelity Q5_K_S leads at every
draft depth (87.9% at n=1, still 66.0% at n=4), and even the 2-bit IQ2_M accepts 83% of
single-token drafts.
Usage — add --model-draft + --spec-type draft-mtp to the server command:
./llama-server \
-m gemma-4-31B-it-IQ4_XS.gguf \
--model-draft mtp-gemma-4-31B-it.gguf \
--spec-type draft-mtp --spec-draft-n-max 4 \
--jinja -ngl 999 -fa on \
--host 127.0.0.1 --port 1234
The drafter lives at the repo root so
--spec-type draft-mtpauto-discovers it when you load the trunk with-hf(no manual--model-draftneeded):llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ4_XS --spec-type draft-mtp --spec-draft-n-max 4.Needs a llama.cpp build with
gemma4-assistant+draft-mtpsupport (anymasterafter 2026-06-07; this release used@ f3e1828). The drafter pairs with the vision--mmprojtoo — text, image, and speculative decoding can all be active at once.
🪪 License & attribution
- Inherits the Gemma Terms of Use from the base model.
- Base weights:
google/gemma-4-31B-it. - MTP drafter converted from
google/gemma-4-31B-it-assistant(same Gemma Terms of Use). - Calibration + quantization: Quant-Tuner with vendored llama.cpp
@ f3e1828. - Calibration logs mined with LogMiner.
- Downloads last month
- 17,578