Gemma-4-31B-Fable-5-Distilled — GGUF (with Multimodal Vision)

Released by AutoTrust AI Lab · Converted by Cloud Yu (Chief AI Architect) Source model: autotrust/gemma4-31B-Fable-5-Distilled · License: Gemma

GGUF quantizations of our Gemma-4-31B Fable-5 Distilled model for local inference via llama.cpp, Ollama, LM Studio, Jan, and any GGUF-compatible runtime — across macOS, Windows, Linux, iOS, and Android, on CPU, CUDA, Metal, Vulkan, and ROCm backends.


🚀 What this model is

A LoRA fine-tune of google/gemma-4-31B-it on agentic coding traces from Fable 5, with two distinctive properties:

  • 🏆 HumanEval pass@1: 92.7% (vs Google's official 76.8% on the base model — +15.9 points)
  • 👁 Multimodal vision fully preserved — uniquely among coding fine-tunes of Gemma 4

We achieve this by freezing layers 0–29 (the multimodal fusion stack) and applying LoRA only to layers 30–59 (the language head). The vision encoder is shipped as a separate mmproj file for use with llama-mtmd-cli and llama-server.

See the full model card for benchmark details, training methodology, and the layer-freezing architecture diagram.


👁 Multimodal Vision — The Killer Feature

Most Gemma fine-tunes drop vision. This one keeps it. Load the text model alongside the multimodal projector and you get a fully working text + image chat model that runs locally.

./build/bin/llama-mtmd-cli --jinja \
  -m gemma4-31b-Fable-5-Q8_0.gguf \
  --mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
  --image /path/to/your/image.png \
  -p "Describe this image in detail."
File Role
gemma4-31b-Fable-5-{F16,Q8_0}.gguf Text decoder (load with -m)
mmproj-gemma4-31b-Fable-5-F16.gguf Vision encoder (load with --mmproj)

Required: Both files together for image inputs. Text-only chat works with just the text decoder.


📦 Available Files

File Size Quality Recommended Hardware
gemma4-31b-Fable-5-F16.gguf 58 GB Baseline (full precision) A100 80GB / M2 Ultra / 2× RTX 4090
gemma4-31b-Fable-5-Q8_0.gguf 31 GB ~99% of F16 quality RTX 4090 (24GB) + offload / M2 Max
gemma4-31b-Fable-5-Q4_K_M.gguf ~19 GB Community-validated sweet spot RTX 4080 (16GB) / M2 Pro 32GB / Mac M1 Pro
mmproj-gemma4-31b-Fable-5-F16.gguf 1.2 GB Vision encoder (F16) Loaded alongside any text model

Quantization quality

Quant Size Quality Notes
F16 58 GB 92.7% HumanEval (measured) Reference precision
Q8_0 31 GB ~99% of F16 (estimated) Recommended where VRAM allows — visually identical image quality
Q4_K_M ~19 GB Community-validated good quality Recommended for consumer hardware. The community has converged on this as the reliable Q4 variant for Gemma 4.

A note on Gemma 4 quantization maturity

Based on community feedback through llama.cpp issues and Hugging Face discussions, Gemma 4 quantization is still a maturing area. The architecture's multimodal fusion and Jinja chat template interact in ways that haven't been fully validated below Q4. As of this release:

  • F16, Q8_0, and Q4_K_M are recommended — these have been tested and the community has converged on Q4_K_M as the reliable Q4 variant for Gemma 4.
  • More aggressive quantizations (Q3, IQ3_XXS, Q2, IQ2) are not currently recommended for Gemma 4. Reports of degraded multimodal performance and chat-template misalignment exist, and the imatrix calibration data for Gemma 4 is still being refined community-wide.
  • We will publish additional quants only after they are validated end-to-end (text + vision + tool-use). We'd rather ship fewer reliable variants than chase smaller file sizes at the cost of quality.

If you produce a community quantization (imatrix, IQ-family, etc.) and have validated it across text generation, vision, and tool-use, please share results in the Community tab — we'll feature working community quants on this card.


🛠 Build llama.cpp with Multimodal Support

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build -DLLAMA_BUILD_MTMD=ON
cmake --build build --target llama-mtmd-cli llama-server -j

The LLAMA_BUILD_MTMD=ON flag is required to enable multimodal support.


💬 Quick Start

Option 1: Ollama (easiest)

# Recommended for consumer hardware (16GB+ VRAM):
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:Q4_K_M

# Higher quality if you have the VRAM (24GB+):
ollama run hf.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF:Q8_0

Option 2: llama-server (recommended for production)

./build/bin/llama-server \
  -m gemma4-31b-Fable-5-Q8_0.gguf \
  --mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
  --jinja \
  --host 0.0.0.0 \
  --port 8080

Exposes an OpenAI-compatible HTTP API at http://localhost:8080/v1. Send chat completions with text and/or images.

Option 3: Text chat in terminal

./build/bin/llama-mtmd-cli --jinja \
  -m gemma4-31b-Fable-5-Q8_0.gguf

Option 4: LM Studio / Jan

Search for autotrust/gemma4-31B-Fable-5-Distilled-GGUF in the app and download. Both apps handle the chat template automatically.


🖼 Multimodal (Image + Text) Inference

CLI example

./build/bin/llama-mtmd-cli --jinja \
  -m gemma4-31b-Fable-5-Q8_0.gguf \
  --mmproj mmproj-gemma4-31b-Fable-5-F16.gguf \
  --image ./screenshot.png \
  -p "What does this UI mockup show? Identify each component and suggest improvements." \
  --temp 0.7 \
  -n 512

llama-server HTTP API (OpenAI-compatible)

Once llama-server is running with --mmproj, send a vision request:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-31b-Fable-5",
    "messages": [
      {
        "role": "user",
        "content": [
          {"type": "text",      "text": "Describe this image in detail."},
          {"type": "image_url", "image_url": {"url": "data:image/png;base64,<BASE64_DATA>"}}
        ]
      }
    ],
    "max_tokens": 512
  }'

Tip: We recommend --jinja (load the built-in chat template from GGUF metadata) over hardcoding tokens. The model's correct chat template is embedded in the GGUF and applied automatically. If you need to inspect or override it, see tokenizer.chat_template in the GGUF metadata via llama-gguf-info.


🐍 Python (transformers — for reference; serving via llama.cpp is recommended for GGUF)

If you prefer Python and have GPU resources, use the full-precision sibling model directly with transformers:

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image

model_id = "autotrust/gemma4-31B-Fable-5-Distilled"   # BF16 sibling
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto")

image = Image.open("image.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image."},
    ],
}]
prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

For GGUF + Python, use llama-cpp-python (currently limited multimodal support — track upstream for mtmd bindings).


⚙️ Recommended Generation Settings

Use case Temperature Top-p Notes
Code generation 0.1–0.3 0.95 Deterministic, follows function signatures
Tool-use / agentic 0.3–0.5 0.95 Balance creativity and structured output
Image description 0.7 0.95 Allow descriptive variation
General chat 0.7 0.95 Default
Thinking mode (on) 0.7 0.95 Allocate ≥ 1024 max_tokens to fit reasoning chains

🎯 Intended Use & Capabilities

  • Agentic code generation with chain-of-thought reasoning and structured tool-call outputs
  • Vision-grounded coding — describe a UI mockup, screenshot, or diagram and ask for code
  • Local-first deployment — no API keys, no telemetry, fully air-gapped capable
  • General multimodal chat with the base Gemma 4 vision quality fully preserved

⚠️ Notes & Limitations

  • The --jinja flag is required — the model uses a custom Jinja chat template embedded in the GGUF metadata.
  • For image inputs, both -m (text model) and --mmproj (vision encoder) must be loaded.
  • Q8_0 image-recognition quality is empirically indistinguishable from F16; Q4_K_M is the community-validated sweet spot for consumer hardware. More aggressive quants (Q3 and below) are not currently recommended for Gemma 4 — see the quantization maturity note above.
  • The model occasionally omits stdlib imports (re, math) in code completions — an artifact of the Fable-5 training distribution. A future revision will rebalance this.
  • Inherits Gemma's base-model limitations: factual recall errors are possible. Pair with retrieval for production knowledge work.

🔗 Related Models


📖 Citation

@misc{autotrust2026gemma4fable5gguf,
  title        = {Gemma-4-31B-Fable-5-Distilled GGUF: Quantized Multimodal Variants
                  for Local Inference},
  author       = {{AutoTrust AI Lab} and Yu, Cloud},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/autotrust/gemma4-31B-Fable-5-Distilled-GGUF}},
  note         = {Contact: cloud.yu@autotrust.ai}
}

🏛 About AutoTrust AI Lab

AutoTrust AI Lab builds open foundation models and agentic systems for scientific research and coding. Our flagship products are PaperGuru AI (agentic academic research) and the upcoming ScienceGuru.

We welcome community feedback, benchmarks, and quantization contributions — please open a thread in the Community tab.

Downloads last month
200
GGUF
Model size
31B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for autotrust/gemma4-31B-Fable-5-Distilled-GGUF

Quantized
(1)
this model