Instructions to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="autotrust/gemma4-31B-Fable-5-Distilled-GGUF",
	filename="gemma4-31b-Fable-5-F16.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": [
				{
					"type": "text",
					"text": "Describe this image in one sentence."
				},
				{
					"type": "image_url",
					"image_url": {
						"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
					}
				}
			]
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
# Run inference directly in the terminal:
llama cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
# Run inference directly in the terminal:
llama cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
# Run inference directly in the terminal:
./llama-cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Use Docker

docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

LM Studio
Jan

vLLM

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "autotrust/gemma4-31B-Fable-5-Distilled-GGUF"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/gemma4-31B-Fable-5-Distilled-GGUF",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Ollama
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Ollama:
```
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
```

Unsloth Studio

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF to start chatting

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Docker Model Runner:
```
docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
```

Lemonade

How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16

Run and chat with the model

lemonade run user.gemma4-31B-Fable-5-Distilled-GGUF-F16

List all available models

lemonade list

Gemma-4-31B-Fable-5-Distilled — GGUF (with Multimodal Vision)

Released by AutoTrust AI Lab · Converted by Cloud Yu (Chief AI Architect) Source model: autotrust/gemma4-31B-Fable-5-Distilled · License: Gemma

GGUF quantizations of our Gemma-4-31B Fable-5 Distilled model for local inference via llama.cpp, Ollama, LM Studio, Jan, and any GGUF-compatible runtime — across macOS, Windows, Linux, iOS, and Android, on CPU, CUDA, Metal, Vulkan, and ROCm backends.

🚀 What this model is

A LoRA fine-tune of google/gemma-4-31B-it on agentic coding traces from Fable 5, with two distinctive properties:

🏆 HumanEval pass@1: 92.7% (vs Google's official 76.8% on the base model — +15.9 points)
👁 Multimodal vision fully preserved — uniquely among coding fine-tunes of Gemma 4

We achieve this by freezing layers 0–29 (the multimodal fusion stack) and applying LoRA only to layers 30–59 (the language head). The vision encoder is shipped as a separate mmproj file for use with llama-mtmd-cli and llama-server.

See the full model card for benchmark details, training methodology, and the layer-freezing architecture diagram.

👁 Multimodal Vision — The Killer Feature

Most Gemma fine-tunes drop vision. This one keeps it. Load the text model alongside the multimodal projector and you get a fully working text + image chat model that runs locally.

./build/bin/llama-mtmd-cli --jinja \
  -m gemma4-31b-Fable-5-Q8_0.gguf \
  --mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
  --image /path/to/your/image.png \
  -p "Describe this image in detail."

File	Role
`gemma4-31b-Fable-5-{F16,Q8_0}.gguf`	Text decoder (load with `-m`)
`mmproj-gemma4-31b-Fable-5-F16.gguf`	Vision encoder (load with `--mmproj`)

Required: Both files together for image inputs. Text-only chat works with just the text decoder.

📦 Available Files

File	Size	Quality	Recommended Hardware
`gemma4-31b-Fable-5-F16.gguf`	58 GB	Baseline (full precision)	A100 80GB / M2 Ultra / 2× RTX 4090
`gemma4-31b-Fable-5-Q8_0.gguf`	31 GB	~99% of F16 quality	RTX 4090 (24GB) + offload / M2 Max
`gemma4-31b-Fable-5-Q4_K_M.gguf`	~19 GB	Community-validated sweet spot	RTX 4080 (16GB) / M2 Pro 32GB / Mac M1 Pro
`mmproj-gemma4-31b-Fable-5-F16.gguf`	1.2 GB	Vision encoder (F16)	Loaded alongside any text model

Quantization quality

Quant	Size	Quality	Notes
F16	58 GB	92.7% HumanEval (measured)	Reference precision
Q8_0	31 GB	~99% of F16 (estimated)	Recommended where VRAM allows — visually identical image quality
Q4_K_M	~19 GB	Community-validated good quality	Recommended for consumer hardware. The community has converged on this as the reliable Q4 variant for Gemma 4.

A note on Gemma 4 quantization maturity

Based on community feedback through llama.cpp issues and Hugging Face discussions, Gemma 4 quantization is still a maturing area. The architecture's multimodal fusion and Jinja chat template interact in ways that haven't been fully validated below Q4. As of this release:

F16, Q8_0, and Q4_K_M are recommended — these have been tested and the community has converged on Q4_K_M as the reliable Q4 variant for Gemma 4.
More aggressive quantizations (Q3, IQ3_XXS, Q2, IQ2) are not currently recommended for Gemma 4. Reports of degraded multimodal performance and chat-template misalignment exist, and the imatrix calibration data for Gemma 4 is still being refined community-wide.
We will publish additional quants only after they are validated end-to-end (text + vision + tool-use). We'd rather ship fewer reliable variants than chase smaller file sizes at the cost of quality.

If you produce a community quantization (imatrix, IQ-family, etc.) and have validated it across text generation, vision, and tool-use, please share results in the Community tab — we'll feature working community quants on this card.

🛠 Build llama.cpp with Multimodal Support

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_BUILD_MTMD=ON
cmake --build build --target llama-mtmd-cli llama-server -j

The LLAMA_BUILD_MTMD=ON flag is required to enable multimodal support.

💬 Quick Start

Option 1: Ollama (easiest)

# Recommended for consumer hardware (16GB+ VRAM):
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:Q4_K_M

# Higher quality if you have the VRAM (24GB+):
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:Q8_0

Option 2: llama-server (recommended for production)

./build/bin/llama-server \
  -m gemma4-31b-Fable-5-Q8_0.gguf \
  --mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
  --jinja \
  --host 0.0.0.0 \
  --port 8080

Exposes an OpenAI-compatible HTTP API at http://localhost:8080/v1. Send chat completions with text and/or images.

Option 3: Text chat in terminal

./build/bin/llama-mtmd-cli --jinja \
  -m gemma4-31b-Fable-5-Q8_0.gguf

Option 4: LM Studio / Jan

Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF in the app and download. Both apps handle the chat template automatically.

🖼 Multimodal (Image + Text) Inference

CLI example

./build/bin/llama-mtmd-cli --jinja \
  -m gemma4-31b-Fable-5-Q8_0.gguf \
  --mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
  --image ./screenshot.png \
  -p "What does this UI mockup show? Identify each component and suggest improvements." \
  --temp 0.7 \
  -n 512

llama-server HTTP API (OpenAI-compatible)

Once llama-server is running with --mmproj, send a vision request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-31b-Fable-5",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text",      "text": "Describe this image in detail."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_DATA>"}}
        ]
      }
    ],
    "max_tokens": 512
  }'

Tip: We recommend --jinja (load the built-in chat template from GGUF metadata) over hardcoding tokens. The model's correct chat template is embedded in the GGUF and applied automatically. If you need to inspect or override it, see tokenizer.chat_template in the GGUF metadata via llama-gguf-info.

🐍 Python (transformers — for reference; serving via llama.cpp is recommended for GGUF)

If you prefer Python and have GPU resources, use the full-precision sibling model directly with transformers:

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_id = "autotrust/gemma4-31B-Fable-5-Distilled"   # BF16 sibling
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

image = Image.open("image.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image."},
    ],
}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

For GGUF + Python, use llama-cpp-python (currently limited multimodal support — track upstream for mtmd bindings).

⚙️ Recommended Generation Settings

Use case	Temperature	Top-p	Notes
Code generation	0.1–0.3	0.95	Deterministic, follows function signatures
Tool-use / agentic	0.3–0.5	0.95	Balance creativity and structured output
Image description	0.7	0.95	Allow descriptive variation
General chat	0.7	0.95	Default
Thinking mode (on)	0.7	0.95	Allocate ≥ 1024 max_tokens to fit reasoning chains

🎯 Intended Use & Capabilities

Agentic code generation with chain-of-thought reasoning and structured tool-call outputs
Vision-grounded coding — describe a UI mockup, screenshot, or diagram and ask for code
Local-first deployment — no API keys, no telemetry, fully air-gapped capable
General multimodal chat with the base Gemma 4 vision quality fully preserved

⚠️ Notes & Limitations

The --jinja flag is required — the model uses a custom Jinja chat template embedded in the GGUF metadata.
For image inputs, both -m (text model) and --mmproj (vision encoder) must be loaded.
Q8_0 image-recognition quality is empirically indistinguishable from F16; Q4_K_M is the community-validated sweet spot for consumer hardware. More aggressive quants (Q3 and below) are not currently recommended for Gemma 4 — see the quantization maturity note above.
The model occasionally omits stdlib imports (re, math) in code completions — an artifact of the Fable-5 training distribution. A future revision will rebalance this.
Inherits Gemma's base-model limitations: factual recall errors are possible. Pair with retrieval for production knowledge work.

🔗 Related Models

🤗 autotrust/gemma4-31B-Fable-5-Distilled — Source model (BF16 full precision, transformers / vLLM / SGLang)
🤗 autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF — Our first open release: 120B-parameter MoE reasoning model

📖 Citation

@misc{autotrust2026gemma4fable5gguf,
  title        = {Gemma-4-31B-Fable-5-Distilled GGUF: Quantized Multimodal Variants
                  for Local Inference},
  author       = {{AutoTrust AI Lab} and Yu, Cloud},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF}},
  note         = {Contact: cloud.yu@autotrust.ai}
}

🏛 About AutoTrust AI Lab

AutoTrust AI Lab builds open foundation models and agentic systems for scientific research and coding. Our flagship products are PaperGuru AI (agentic academic research) and the upcoming ScienceGuru.