Instructions to use autotrust/gemma4-31B-Fable-5-Distilled with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use autotrust/gemma4-31B-Fable-5-Distilled with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="autotrust/gemma4-31B-Fable-5-Distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("autotrust/gemma4-31B-Fable-5-Distilled")
model = AutoModelForMultimodalLM.from_pretrained("autotrust/gemma4-31B-Fable-5-Distilled")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use autotrust/gemma4-31B-Fable-5-Distilled with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "autotrust/gemma4-31B-Fable-5-Distilled"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/gemma4-31B-Fable-5-Distilled",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled

SGLang

How to use autotrust/gemma4-31B-Fable-5-Distilled with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "autotrust/gemma4-31B-Fable-5-Distilled" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/gemma4-31B-Fable-5-Distilled",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "autotrust/gemma4-31B-Fable-5-Distilled" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "autotrust/gemma4-31B-Fable-5-Distilled",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use autotrust/gemma4-31B-Fable-5-Distilled with Docker Model Runner:
```
docker model run hf.co/autotrust/gemma4-31B-Fable-5-Distilled
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Gemma-4-31B-Fable-5-Distilled

Released by AutoTrust AI Lab · Trained by Hai Yu Base model: google/gemma-4-31B-it · Method: LoRA (r=16) · License: Gemma

A parameter-efficient fine-tune of google/gemma-4-31B-it on agentic coding traces from Fable 5, designed to lift coding and tool-use performance without sacrificing the base model's vision capabilities — a common failure mode of coding fine-tunes.

🤗 GGUF variant available: autotrust/gemma4-31B-Fable-5-Distilled-GGUF — F16 + Q8_0 with multimodal projector, runs on llama.cpp / Ollama / LM Studio / Jan.

🏆 Benchmark Results

Model	HumanEval pass@1	Δ vs Base
gemma4-31B-Fable-5-Distilled (ours)	92.7% (152/164)	+15.9 pts
`google/gemma-4-31B-it` (official)	76.8%	baseline

Evaluation: HumanEval (164 Python problems), vLLM 0.22, T=0.1, thinking=off, batch generate. Identical result (92.7%) reproduced via vLLM server API with --reasoning-parser gemma4 at T=0.2.

Why it matters: We achieve this lift with only 0.20% of parameters trainable (61.2M / 31.27B) and without degrading multimodal vision — see Layer-Freezing Strategy below.

🔬 Layer-Freezing Strategy: Preserving Multimodal Vision

Most coding fine-tunes of multimodal models destroy the vision-language fusion learned during base pretraining. We avoid this by applying LoRA adapters only to the upper half of the transformer stack:

┌─────────────────────────────────────────────────────────────┐
│  Layers 30–59  │  🟢 LoRA-adapted (language head)           │  ← coding & tool-use uplift
│   (30 layers)  │     Q/K/V/O + gate/up/down projections     │
├─────────────────────────────────────────────────────────────┤
│  Layers 0–29   │  🔒 FROZEN (multimodal fusion)             │  ← vision preserved exactly
│   (30 layers)  │     Visual feature processing untouched    │
└─────────────────────────────────────────────────────────────┘
            ▲
            │
    Vision encoder (mmproj) — fully frozen

Layers	State	Role
0–29	🔒 Frozen	Low-level multimodal fusion, visual features
30–59	🟢 LoRA	Higher-level language & generation

Result: image description quality on held-out samples matches the base model bit-for-bit, while coding pass@1 lifts +15.9 points. Trainable parameters cut nearly in half vs. naive full-layer LoRA (~122M → 61.2M).

LoRA target modules (regex-matched):

language_model.layers.{30..59}.(self_attn|mlp).(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)

📊 Model Details

Property	Value
Base Model	google/gemma-4-31B-it
Architecture	`Gemma4ForConditionalGeneration` (text decoder + vision encoder)
Parameters	31.27B (bfloat16)
Fine-tuning Method	LoRA
LoRA Rank	r=16, α=32, dropout=0.05
Trainable Parameters	61.2M (0.20% of total)
Sequence Length	2048 tokens
Thinking Mode	Enabled (native Gemma 4 multi-channel format)
License	Gemma Terms of Use

📚 Dataset: Quality-First Curation

Source: Glint-Research/Fable-5-traces — 23,325 raw interaction records from Fable 5, an agentic coding assistant.

Final training set: 308 conversation pairs after rigorous quality filtering.

Why so few? Quality-first curation. Each retained example is a complete tool-use conversation with verified outputs — full thinking traces, valid tool calls, and successful resolutions. In our ablations, this small high-signal set outperformed larger but noisier datasets (10K+ raw pairs) on both HumanEval and tool-use evaluations. The +15.9 point HumanEval lift is achieved on 308 examples, demonstrating that for post-training of strong base models, example quality dominates example count.

Preprocessing pipeline:

Filter to type == "message" records only
Group user–assistant message pairs by parentId
Apply Gemma 4 chat template with full thinking + tool-call structure
Completion-only loss masking: prompt → -100, only assistant response contributes to loss
Drop samples > 2048 tokens
Final: 308 high-quality conversation pairs

Each training example contains:

Thinking blocks (type: "thinking") — chain-of-thought reasoning
Tool calls (type: "toolCall") — structured invocations with name + arguments
Text blocks (type: "text") — final response

⚙️ Prompt Loss Masking

Loss is computed only on assistant response tokens (thinking + tool calls + final text). Prompt tokens (system + user) are labeled -100, so the model is never penalized for failing to predict user input.

input_ids = prompt_ids + completion_ids
labels    = [-100] * len(prompt_ids) + completion_ids

🛠 Training Hyperparameters

Hyperparameter	Value
Optimizer	AdamW (default)
Learning Rate	2e-4
LR Scheduler	Cosine
Warmup Steps	50
Batch Size (per device)	1
Gradient Accumulation	16 (effective batch = 16)
Precision	bfloat16
Gradient Checkpointing	Enabled
Epochs	1

🚀 Usage

Text reasoning with thinking

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "autotrust/gemma4-31B-Fable-5-Distilled"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user",   "content": "Write a Python function to reverse a linked list."},
]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))

Multimodal (image + text)

from PIL import Image

image = Image.open("path/to/image.png").convert("RGB")

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."},
    ],
}]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Serving with vLLM

pip install vllm
vllm serve autotrust/gemma4-31B-Fable-5-Distilled --reasoning-parser gemma4

🎯 Intended Use

Agentic code generation & explanation with chain-of-thought reasoning
Tool-use planning with structured JSON tool-call outputs
Image description & visual reasoning (multimodal capability fully preserved)
General-purpose chat with thinking mode

⚠️ Known Limitations

Small fine-tuning set: 308 examples. May not generalize to all coding domains; consider further fine-tuning on your domain.
Thinking-mode dependency: The model was trained with enable_thinking=True. Responses without thinking may be suboptimal — keep thinking on for production use.
Tool calls are JSON-serialized (not bound to a runtime). You provide the execution layer.
Inherits Gemma base limitations: factual recall errors, occasional hallucination — pair with retrieval for production knowledge tasks.

📈 Evaluation: HumanEval Details

Configurations tested:

Configuration	Pass@1	Engine	Settings
vLLM offline batch	92.7% (152/164)	vLLM 0.22	T=0.1, thinking=off, batch generate
vLLM server API	92.7% (152/164)	vLLM 0.22	T=0.2, thinking=off, `--reasoning-parser gemma4`
Google Official (base)	76.8%	(internal)	T=0.1, thinking=on, base `gemma-4-31B-it`

Failure analysis (12 / 164 failed)

Type	Count	Detail
Missing imports	8	`re` (4), `math` (3), `decimal` (1), `hashlib` (1) — model omits stdlib imports
Logical errors	4	Code compiles but fails test assertions

The missing-import failures suggest a remediable distillation artifact (Fable 5 traces often elide stdlib imports). A future revision will rebalance the dataset to retain explicit imports.

Methodology

Dataset: openai/openai_humaneval (164 problems)
System prompt: "You are a Python coding assistant. Return ONLY the completed function inside python ... ."
User prompt: "Complete this Python function: python\n{prompt}\n"
Extraction: Parse markdown code blocks → strip signature/imports → normalize indentation
Verification: Standard prompt + body + test + check(entry_point) harness, 5s timeout

Thinking mode note

On vLLM, enable_thinking=True with --reasoning-parser gemma4 produces verbose thinking traces that can exceed the token budget, resulting in finish_reason=length and empty content. Google AI Studio API handles this correctly by separating thinking from the final answer. Benchmarks above use enable_thinking=False for reliable extraction. For interactive use, keep thinking on with a higher max_tokens budget (we recommend ≥ 1024).

🔗 Related Models

🤗 autotrust/gemma4-31B-Fable-5-Distilled-GGUF — quantized GGUF variants (F16 / Q8_0) + multimodal projector for local llama.cpp / Ollama / LM Studio inference
🤗 autotrust/gpt-oss-120b-Fable-5-Distilled-GGUF — our first open release: 120B-parameter MoE reasoning model

📖 Citation

@misc{autotrust2026gemma4fable5,
  title        = {Gemma-4-31B-Fable-5-Distilled: Layer-Frozen LoRA Distillation
                  Preserving Multimodal Vision},
  author       = {{AutoTrust AI Lab} and Yu, Cloud},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/autotrust/gemma4-31B-Fable-5-Distilled}},
  note         = {Contact: cloud.yu@autotrust.ai}
}

🏛 About AutoTrust AI Lab

AutoTrust AI Lab builds open foundation models and agentic systems for scientific research and coding. Our flagship products are PaperGuru AI (agentic academic research) and the upcoming ScienceGuru.