Instructions to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="pearsonkyle/gemma4-31b-imatrix-mtp-GGUF")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoModel
model = AutoModel.from_pretrained("pearsonkyle/gemma4-31b-imatrix-mtp-GGUF", dtype="auto")

llama-cpp-python

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="pearsonkyle/gemma4-31b-imatrix-mtp-GGUF",
	filename="gemma-4-31B-it-IQ2_M.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
# Run inference directly in the terminal:
llama cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
# Run inference directly in the terminal:
llama cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
# Run inference directly in the terminal:
./llama-cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Use Docker

docker model run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

LM Studio
Jan

vLLM

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

SGLang

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Ollama
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Ollama:
```
ollama run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
```

Unsloth Studio

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for pearsonkyle/gemma4-31b-imatrix-mtp-GGUF to start chatting

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Docker Model Runner:
```
docker model run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M
```

Lemonade

How to use pearsonkyle/gemma4-31b-imatrix-mtp-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

Run and chat with the model

lemonade run user.gemma4-31b-imatrix-mtp-GGUF-IQ2_M

List all available models

lemonade list

🧊 Google/Gemma-4-31B-it · imatrix · GGUF

imatrix (hybrid)

📦 10.2 · 13.4 · 15.6 · 19.8 GiB IQ2_M · IQ3_M · IQ4_XS · Q5_K_S 🏗️ llama.cpp f3e1828 🏅 Agent · 100% patch 👁️ Text + Image · mmproj 772 MB ⚡ MTP drafter · 88% accept @ n=1 🎯 Q5_K_S · text+vision+MTP in 24 GB

🧊 What this is

An aggressively compressed (under 3 bpw) IQ2_M quantization of google/gemma-4-31B-it, calibrated with a hybrid imatrix built from real coding/tool-use logs. Runs in vanilla llama.cpp / Ollama / LM Studio — no custom runtime, no extra inference cost. Higher-bit IQ3_M (3.76 bpw), IQ4_XS (4.36 bpw), and Q5_K_S (5.55 bpw — the highest-fidelity build, KLD 0.025 vs FP16) builds are also included for users with more VRAM.

👁️ Now with vision (text + image input)

Gemma 4 is natively multimodal. This repo ships the model's vision tower as a separate mmproj-gemma-4-31B-it-Q8_0.gguf (772 MB, SigLIP-style 27-layer encoder, Q8_0 — visually lossless vs F16 at ⅔ the size). Pair it with any of the four quant files via --mmproj and the model can see images — describe screenshots, read diagrams, answer questions about a UI, and so on. The text quant is unchanged; vision adds only the small mmproj. See Usage → Vision below.

🎯 The 24 GB build (Q5_K_S)

The new Q5_K_S build (5.55 bpw, 19.85 GiB) is sized so a single 24 GB GPU can host the full stack at once: the 5-bit text trunk (19.85) + the Q8 MTP drafter (0.48) + the Q8 vision mmproj (0.75) = ~21.1 GiB, leaving ~2.9 GiB for a real KV cache (more with --cache-type-k q8_0 --cache-type-v q8_0). Near-FP16 quality (KLD 0.025), images, and speculative decoding — all on one consumer card, no offload.

📉 ~5.6× smaller10.17 GiB on disk vs 57.2 GiB FP16, at ~2.85 bits/weight.

🤖 Actually agentic47% pass / 100% patch on a 10-instance agentic SWE-rebench holdout (IQ4_XS). IQ2_M still resolved 40% — best of every sub-3-bpw arm tested.

🛠️ Standard GGUFLoads anywhere llama.cpp runs. No patches, kernels, or forks.

📊 Unified benchmark & quality table

Agentic metrics from a SWE-rebench holdout run through the OpenAI Agents SDK (10 instances × 3 reps). Static metrics (PPL / KLD / top-p) measured against FP16 on a held-out eval corpus at ctx=4096. KLD column is median for robustness to per-token tails.

Metric	FP16 (ref)	Q5_K_S	IQ4_XS	IQ3_M	IQ2_M
File	—	Q5_K_S.gguf	IQ4_XS.gguf	IQ3_M.gguf	IQ2_M.gguf
Quality	-	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐	⭐
BPW	16.0	5.55	4.36	3.76	2.85
Size (GiB)	57.20	19.85	15.59	13.43	10.17
🤖 Pass Rate	—	40±8%	47±5%	33±12%	40±8%
🤖 Patch Rate	—	100%	100%	100%	100%
🤖 Tool Errors	—	11±2%	10±3%	16±2%	16±1%
🤖 Mean Tokens	—	663K±111K	575K±70K	483K±75K	558K±94K
📐 PPL	215.5	256.5	319.4	734.1	1958.7
📐 KLD (med)	0.000	0.025	0.073	0.435	1.571
📐 same_top_p	100.0%	85.5%	78.8%	63.1%	46.6%

Q5_K_S resolves 40% of the holdout (tying IQ2_M, ahead of IQ3_M) at 100% patch and a low 11% tool-error rate (on par with IQ4_XS, well under the IQ2/IQ3 arms' 16%) — while being the highest-fidelity build on the static metrics (KLD 0.025). IQ4_XS remains the agentic leader at 47%; the gap is within run-to-run noise.

📌 Sampling & methodology details

Sampling: temperature=0.25, top_p=0.95, top_k=20, max_tokens=32768, ctx=131072, thinking=false. Run on Apple Silicon (Metal); SWE-rebench linux/amd64 images under emulation, so wall-clock is relative, not absolute.

Pass Rate = gold tests pass after agent's patch (real resolution). Patch Rate = non-empty diff produced.

🔬 How it was made

Hybrid imatrix — activation energy E[a²] mixed with weight-column energy ‖W[:,i]‖²·E[a²] per tensor, collected over real coding/tool-use logs + wiki.test.raw via quant-tuner.
IQ2_M codebook — 2-bit E8-lattice non-uniform codes with per-tensor tier bumps (attention output, early ffn_down get more bits). llama-quantize decides the mix.
Vision mmproj — the model's SigLIP-style vision tower (27 layers, 280 soft tokens/image) exported separately at Q8_0 with convert_hf_to_gguf.py --mmproj (visually lossless, 772 MB), so the encoder stays high-precision while the text path runs at 2 bits. No audio encoder is shipped (the source has none).
Disjoint splits — calibration (imatrix), validation (per-tensor α gate), and eval (PPL/KLD) come from different corpora; the SWE-rebench holdout never appears in any calibration set.
Toolchain: quant-tuner for imatrix calibration, llama.cpp @ f3e1828 for final quantization. Calibration logs mined with LogMiner.

🚀 Usage

Ollama

ollama run hf.co/pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ2_M

llama.cpp (GPU)

# Build with CUDA (-DGGML_CUDA=OFF for CPU/Metal)
git clone https://github.com/ggml-org/llama.cpp
cmake llama.cpp -B llama.cpp/build -DBUILD_SHARED_LIBS=OFF -DGGML_CUDA=ON
cmake --build llama.cpp/build --config Release -j --clean-first --target llama-cli llama-server
cp llama.cpp/build/bin/llama-* llama.cpp/

# Run the server
./llama-server \
    --model gemma-4-31B-it-IQ2_M.gguf \
    --ctx-size 16384 --n-gpu-layers 999 --split-mode layer \
    --flash-attn on --cache-type-k q8_0 --cache-type-v q8_0 \
    --parallel 1 --batch-size 2048 --ubatch-size 512 \
    --host 0.0.0.0 --port 1234

OpenAI-compatible API (Python)

import json, urllib.request

def ask(content, max_tokens=256):
    body = {
        "messages": [{"role": "user", "content": content}],
        "max_tokens": max_tokens,
        # Gemma 4 is a thinking model — disable or raise max_tokens
        "chat_template_kwargs": {"enable_thinking": False},
    }
    req = urllib.request.Request(
        "http://127.0.0.1:1234/v1/chat/completions",
        json.dumps(body).encode(),
        {"Content-Type": "application/json"},
    )
    return json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"]

print(ask("What is 1+1?"))

🖼️ Vision (text + image)

Gemma 4 is natively multimodal. The vision tower ships separately as mmproj-gemma-4-31B-it-Q8_0.gguf (772 MB) so you only download it if you need images. It pairs with any of the four quant files (IQ2_M / IQ3_M / IQ4_XS / Q5_K_S) — the text weights are identical; the mmproj just adds the SigLIP encoder + projector.

One-shot from the CLI (llama-mtmd-cli):

./llama-mtmd-cli \
    --model gemma-4-31B-it-IQ4_XS.gguf \
    --mmproj mmproj-gemma-4-31B-it-Q8_0.gguf \
    --image screenshot.png \
    --jinja -ngl 999 --temp 0.2 -n 256 \
    -p "Describe this image. What's in it?"

--jinja is required — Gemma 4's chat template is Jinja-based and the CLI aborts without it. --image can be repeated for multi-image prompts; URLs work too.

⚠️ Thinking + the CLI. Gemma 4 is a reasoning model. From llama-mtmd-cli, leave thinking on and give it enough budget (-n 800+) so the answer survives the reasoning preamble — the --chat-template-kwargs '{"enable_thinking":false}' flag currently returns an empty completion on the CLI path. To get a clean, reasoning-free answer, disable thinking over the HTTP server instead (below).

Vision server — host the quant with the mmproj attached (this is exactly how the worked example above was generated). --jinja is required; the vision tower is loaded via --mmproj:

./llama-server \
    -m gemma-4-31B-it-IQ4_XS.gguf \
    --mmproj mmproj-gemma-4-31B-it-Q8_0.gguf \
    --jinja --ctx-size 8192 --n-gpu-layers 999 \
    --host 127.0.0.1 --port 1234

Vision is purely additive — drop the --mmproj flag and you're back to the identical text-only model.

The OpenAI-compatible /v1/chat/completions endpoint then accepts image_url content parts. With chat_template_kwargs.enable_thinking=false the server returns just the answer (no reasoning preamble). This is the exact call used to generate the mecha prompt above:

import base64, json, urllib.request

with open("mecha.png", "rb") as f:
    b64 = base64.b64encode(f.read()).decode()

body = {
    "messages": [{"role": "user", "content": [
        {"type": "text", "text": (
            "Look at this image and write a single, detailed text-to-image "
            "generation prompt that would recreate it. Cover the subject, colors, "
            "pose, lighting, style, and background. Respond with only the prompt."
        )},
        {"type": "image_url",
         "image_url": {"url": f"data:image/png;base64,{b64}"}},
    ]}],
    "max_tokens": 400,
    "temperature": 0.3,
    "chat_template_kwargs": {"enable_thinking": False},
}
req = urllib.request.Request(
    "http://127.0.0.1:1234/v1/chat/completions",
    json.dumps(body).encode(),
    {"Content-Type": "application/json"},
)
print(json.loads(urllib.request.urlopen(req).read())["choices"][0]["message"]["content"])

Worked example — same image, all three quants

Each shipped quant was handed the mecha on the right and asked to "write a single, detailed text-to-image generation prompt that would recreate it".

anime mecha — white/lavender armor, neon-green energy sword, purple railgun

Quant	BPW	Generated text-to-image prompt
IQ4_XS	4.36	A high-detail anime-style illustration of a heavy armored mecha robot standing in a powerful, wide-legged stance against a solid black background. The mecha features bulky, angular white and light-grey plating with deep purple accents and glowing neon green energy vents and eyes. In its right hand, it holds a massive, oversized purple railgun, and in its left hand, a glowing neon green energy blade that casts a vibrant light across the armor. The lighting is dramatic and cinematic, with strong rim lighting and high contrast, emphasizing the metallic textures and the bioluminescent glow of the green components. The art style is clean, modern mecha anime with sharp line work and smooth digital shading.
IQ3_M	3.76	Full-body shot of a heavily armored futuristic mecha robot in a powerful standing pose, designed in a detailed 90s retro anime style. The mecha features a bulky, angular white and light-grey armor plating with deep purple accents and glowing neon green energy vents and sensors integrated into the chest, legs, and eyes. It is armed with a massive, oversized purple sniper rifle held in its right hand and a glowing neon green energy sword held in its left. The lighting is dramatic and moody, with a strong rim light highlighting the silhouette against a solid black background. High contrast, clean cel-shaded lines, sharp mechanical detailing, cinematic composition.
IQ2_M	2.85	Full-body shot of a futuristic mecha robot in a stylized anime aesthetic, featuring a heavy armored chassis in a palette of white, grey, and deep purple. The mecha is posed in a powerful stance, holding a large purple futuristic firearm in its right hand and a glowing neon-green energy blade in its left hand. The design includes glowing mint-green accents and circuitry lines across the chest, legs, and head. The lighting is dramatic and moody, with a strong rim lighting and a dark, atmospheric background with subtle purple gradients and a slight digital scanline texture. High-contrast cel-shaded style with clean lines and sharp metallic reflections.

⚡ Speculative decoding (MTP drafter)

This repo also bundles a multi-token-prediction (MTP) drafter at the repo root, mtp-gemma-4-31B-it.gguf (499 MB, Q8_0) — a self-quantized conversion of google/gemma-4-31B-it-assistant (arch gemma4-assistant, nextn_predict_layers = 4). It predicts up to 4 future tokens from the trunk's hidden state so llama.cpp can verify them in a single forward pass. One drafter serves every quant — it keys off the trunk's hidden size / vocab, not the quantization — and the trunk GGUFs are never modified (it loads as a separate --model-draft).

Acceptance rate vs draft depth (--spec-draft-n-max). Fraction of drafted tokens the trunk accepted, swept over n = 1…4 for each quant (5 mixed coding/reasoning prompts × 200 tokens, temperature=0.3, thinking off; scripts/exp046_mtp_acceptance.py, Q5_K_S via scripts/exp047_q5ks_mtp.py — identical method). Higher n drafts more tokens per step but lowers per-token acceptance — pick n for your hardware (speed isn't reported here, it's machine-specific):

Quant	n=1	n=2	n=3	n=4
Q5_K_S	87.9%	81.8%	73.0%	66.0%
IQ4_XS	86.5%	80.2%	68.6%	64.0%
IQ3_M	87.2%	79.1%	70.8%	64.6%
IQ2_M	83.1%	77.1%	70.6%	61.4%

Acceptance holds up across all four trunks — the highest-fidelity Q5_K_S leads at every draft depth (87.9% at n=1, still 66.0% at n=4), and even the 2-bit IQ2_M accepts 83% of single-token drafts.

Usage — add --model-draft + --spec-type draft-mtp to the server command:

./llama-server \
    -m gemma-4-31B-it-IQ4_XS.gguf \
    --model-draft mtp-gemma-4-31B-it.gguf \
    --spec-type draft-mtp --spec-draft-n-max 4 \
    --jinja -ngl 999 -fa on \
    --host 127.0.0.1 --port 1234

The drafter lives at the repo root so --spec-type draft-mtp auto-discovers it when you load the trunk with -hf (no manual --model-draft needed): llama-server -hf pearsonkyle/gemma4-31b-imatrix-mtp-GGUF:IQ4_XS --spec-type draft-mtp --spec-draft-n-max 4.

Needs a llama.cpp build with gemma4-assistant + draft-mtp support (any master after 2026-06-07; this release used @ f3e1828). The drafter pairs with the vision --mmproj too — text, image, and speculative decoding can all be active at once.

🪪 License & attribution

Inherits the Gemma Terms of Use from the base model.
Base weights: google/gemma-4-31B-it.
MTP drafter converted from google/gemma-4-31B-it-assistant (same Gemma Terms of Use).
Calibration + quantization: Quant-Tuner with vendored llama.cpp @ f3e1828.
Calibration logs mined with LogMiner.