Instructions to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="autotrust/gemma4-31B-Fable-5-Distilled-GGUF", filename="gemma4-31b-Fable-5-F16.gguf", )
llm.create_chat_completion( messages = [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] ) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with llama.cpp:
Install (macOS, Linux)
curl -LsSf https://llama.app/install.sh | sh # Start a local OpenAI-compatible server with a web UI: llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16 # Run inference directly in the terminal: llama cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16 # Run inference directly in the terminal: llama cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16 # Run inference directly in the terminal: ./llama-cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16 # Run inference directly in the terminal: ./build/bin/llama-cli -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Use Docker
docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
- LM Studio
- Jan
- vLLM
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "autotrust/gemma4-31B-Fable-5-Distilled-GGUF" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "autotrust/gemma4-31B-Fable-5-Distilled-GGUF", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
- Ollama
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Ollama:
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
- Unsloth Studio
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF to start chatting
- Pi
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama serve -hf autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Run Hermes
hermes
- Atomic Chat new
- Docker Model Runner
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Docker Model Runner:
docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
- Lemonade
How to use autotrust/gemma4-31B-Fable-5-Distilled-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull autotrust/gemma4-31B-Fable-5-Distilled-GGUF:F16
Run and chat with the model
lemonade run user.gemma4-31B-Fable-5-Distilled-GGUF-F16
List all available models
lemonade list
- Gemma-4-31B-Fable-5-Distilled — GGUF (with Multimodal Vision)
- 🚀 What this model is
- 👁 Multimodal Vision — The Killer Feature
- 📦 Available Files
- 🛠 Build llama.cpp with Multimodal Support
- 💬 Quick Start
- 🖼 Multimodal (Image + Text) Inference
- 🐍 Python (transformers — for reference; serving via llama.cpp is recommended for GGUF)
- ⚙️ Recommended Generation Settings
- 🎯 Intended Use & Capabilities
- ⚠️ Notes & Limitations
- 🔗 Related Models
- 📖 Citation
- 🏛 About AutoTrust AI Lab
- 🚀 What this model is
Gemma-4-31B-Fable-5-Distilled — GGUF (with Multimodal Vision)
Released by AutoTrust AI Lab · Converted by Cloud Yu (Chief AI Architect) Source model: autotrust/gemma4-31B-Fable-5-Distilled · License: Gemma
GGUF quantizations of our Gemma-4-31B Fable-5 Distilled model for local inference via llama.cpp, Ollama, LM Studio, Jan, and any GGUF-compatible runtime — across macOS, Windows, Linux, iOS, and Android, on CPU, CUDA, Metal, Vulkan, and ROCm backends.
🚀 What this model is
A LoRA fine-tune of google/gemma-4-31B-it on agentic coding traces from Fable 5, with two distinctive properties:
- 🏆 HumanEval pass@1: 92.7% (vs Google's official 76.8% on the base model — +15.9 points)
- 👁 Multimodal vision fully preserved — uniquely among coding fine-tunes of Gemma 4
We achieve this by freezing layers 0–29 (the multimodal fusion stack) and applying LoRA only to layers 30–59 (the language head). The vision encoder is shipped as a separate mmproj file for use with llama-mtmd-cli and llama-server.
See the full model card for benchmark details, training methodology, and the layer-freezing architecture diagram.
👁 Multimodal Vision — The Killer Feature
Most Gemma fine-tunes drop vision. This one keeps it. Load the text model alongside the multimodal projector and you get a fully working text + image chat model that runs locally.
./build/bin/llama-mtmd-cli --jinja \
-m gemma4-31b-Fable-5-Q8_0.gguf \
--mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
--image /path/to/your/image.png \
-p "Describe this image in detail."
| File | Role |
|---|---|
gemma4-31b-Fable-5-{F16,Q8_0}.gguf |
Text decoder (load with -m) |
mmproj-gemma4-31b-Fable-5-F16.gguf |
Vision encoder (load with --mmproj) |
Required: Both files together for image inputs. Text-only chat works with just the text decoder.
📦 Available Files
| File | Size | Quality | Recommended Hardware |
|---|---|---|---|
gemma4-31b-Fable-5-F16.gguf |
58 GB | Baseline (full precision) | A100 80GB / M2 Ultra / 2× RTX 4090 |
gemma4-31b-Fable-5-Q8_0.gguf |
31 GB | ~99% of F16 quality | RTX 4090 (24GB) + offload / M2 Max |
gemma4-31b-Fable-5-Q4_K_M.gguf |
~19 GB | Community-validated sweet spot | RTX 4080 (16GB) / M2 Pro 32GB / Mac M1 Pro |
mmproj-gemma4-31b-Fable-5-F16.gguf |
1.2 GB | Vision encoder (F16) | Loaded alongside any text model |
Quantization quality
| Quant | Size | Quality | Notes |
|---|---|---|---|
| F16 | 58 GB | 92.7% HumanEval (measured) | Reference precision |
| Q8_0 | 31 GB | ~99% of F16 (estimated) | Recommended where VRAM allows — visually identical image quality |
| Q4_K_M | ~19 GB | Community-validated good quality | Recommended for consumer hardware. The community has converged on this as the reliable Q4 variant for Gemma 4. |
A note on Gemma 4 quantization maturity
Based on community feedback through llama.cpp issues and Hugging Face discussions, Gemma 4 quantization is still a maturing area. The architecture's multimodal fusion and Jinja chat template interact in ways that haven't been fully validated below Q4. As of this release:
- F16, Q8_0, and Q4_K_M are recommended — these have been tested and the community has converged on Q4_K_M as the reliable Q4 variant for Gemma 4.
- More aggressive quantizations (Q3, IQ3_XXS, Q2, IQ2) are not currently recommended for Gemma 4. Reports of degraded multimodal performance and chat-template misalignment exist, and the imatrix calibration data for Gemma 4 is still being refined community-wide.
- We will publish additional quants only after they are validated end-to-end (text + vision + tool-use). We'd rather ship fewer reliable variants than chase smaller file sizes at the cost of quality.
If you produce a community quantization (imatrix, IQ-family, etc.) and have validated it across text generation, vision, and tool-use, please share results in the Community tab — we'll feature working community quants on this card.
🛠 Build llama.cpp with Multimodal Support
git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_BUILD_MTMD=ON
cmake --build build --target llama-mtmd-cli llama-server -j
The LLAMA_BUILD_MTMD=ON flag is required to enable multimodal support.
💬 Quick Start
Option 1: Ollama (easiest)
# Recommended for consumer hardware (16GB+ VRAM):
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:Q4_K_M
# Higher quality if you have the VRAM (24GB+):
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:Q8_0
Option 2: llama-server (recommended for production)
./build/bin/llama-server \
-m gemma4-31b-Fable-5-Q8_0.gguf \
--mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
--jinja \
--host 0.0.0.0 \
--port 8080
Exposes an OpenAI-compatible HTTP API at http://localhost:8080/v1. Send chat completions with text and/or images.
Option 3: Text chat in terminal
./build/bin/llama-mtmd-cli --jinja \
-m gemma4-31b-Fable-5-Q8_0.gguf
Option 4: LM Studio / Jan
Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF in the app and download. Both apps handle the chat template automatically.
🖼 Multimodal (Image + Text) Inference
CLI example
./build/bin/llama-mtmd-cli --jinja \
-m gemma4-31b-Fable-5-Q8_0.gguf \
--mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
--image ./screenshot.png \
-p "What does this UI mockup show? Identify each component and suggest improvements." \
--temp 0.7 \
-n 512
llama-server HTTP API (OpenAI-compatible)
Once llama-server is running with --mmproj, send a vision request:
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "gemma4-31b-Fable-5",
"messages": [
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image in detail."},
{"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_DATA>"}}
]
}
],
"max_tokens": 512
}'
Tip: We recommend
--jinja(load the built-in chat template from GGUF metadata) over hardcoding tokens. The model's correct chat template is embedded in the GGUF and applied automatically. If you need to inspect or override it, seetokenizer.chat_templatein the GGUF metadata viallama-gguf-info.
🐍 Python (transformers — for reference; serving via llama.cpp is recommended for GGUF)
If you prefer Python and have GPU resources, use the full-precision sibling model directly with transformers:
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
model_id = "autotrust/gemma4-31B-Fable-5-Distilled" # BF16 sibling
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")
image = Image.open("image.png").convert("RGB")
messages = [{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Describe this image."},
],
}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))
For GGUF + Python, use llama-cpp-python (currently limited multimodal support — track upstream for mtmd bindings).
⚙️ Recommended Generation Settings
| Use case | Temperature | Top-p | Notes |
|---|---|---|---|
| Code generation | 0.1–0.3 | 0.95 | Deterministic, follows function signatures |
| Tool-use / agentic | 0.3–0.5 | 0.95 | Balance creativity and structured output |
| Image description | 0.7 | 0.95 | Allow descriptive variation |
| General chat | 0.7 | 0.95 | Default |
| Thinking mode (on) | 0.7 | 0.95 | Allocate ≥ 1024 max_tokens to fit reasoning chains |
🎯 Intended Use & Capabilities
- Agentic code generation with chain-of-thought reasoning and structured tool-call outputs
- Vision-grounded coding — describe a UI mockup, screenshot, or diagram and ask for code
- Local-first deployment — no API keys, no telemetry, fully air-gapped capable
- General multimodal chat with the base Gemma 4 vision quality fully preserved
⚠️ Notes & Limitations
- The
--jinjaflag is required — the model uses a custom Jinja chat template embedded in the GGUF metadata. - For image inputs, both
-m(text model) and--mmproj(vision encoder) must be loaded. - Q8_0 image-recognition quality is empirically indistinguishable from F16; Q4_K_M is the community-validated sweet spot for consumer hardware. More aggressive quants (Q3 and below) are not currently recommended for Gemma 4 — see the quantization maturity note above.
- The model occasionally omits stdlib imports (
re,math) in code completions — an artifact of the Fable-5 training distribution. A future revision will rebalance this. - Inherits Gemma's base-model limitations: factual recall errors are possible. Pair with retrieval for production knowledge work.
🔗 Related Models
- 🤗 autotrust/gemma4-31B-Fable-5-Distilled — Source model (BF16 full precision, transformers / vLLM / SGLang)
- 🤗 autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF — Our first open release: 120B-parameter MoE reasoning model
📖 Citation
@misc{autotrust2026gemma4fable5gguf,
title = {Gemma-4-31B-Fable-5-Distilled GGUF: Quantized Multimodal Variants
for Local Inference},
author = {{AutoTrust AI Lab} and Yu, Cloud},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF}},
note = {Contact: cloud.yu@autotrust.ai}
}
🏛 About AutoTrust AI Lab
AutoTrust AI Lab builds open foundation models and agentic systems for scientific research and coding. Our flagship products are PaperGuru AI (agentic academic research) and the upcoming ScienceGuru.
- 🌐 Website: autotrust.ai
- 🤗 Hugging Face: huggingface.co/autotrust
- 📧 Contact: andy@autotrust.ai
We welcome community feedback, benchmarks, and quantization contributions — please open a thread in the Community tab.
- Downloads last month
- 200
8-bit
16-bit
Model tree for autotrust/gemma4-31B-Fable-5-Distilled-GGUF
Base model
google/gemma-4-31B