Gemma-4-31B-Fable-5-Distilled

Released by AutoTrust AI Lab · Trained by Hai Yu Base model: google/gemma-4-31B-it · Method: LoRA (r=16) · License: Gemma

A parameter-efficient fine-tune of google/gemma-4-31B-it on agentic coding traces from Fable 5, designed to lift coding and tool-use performance without sacrificing the base model's vision capabilities — a common failure mode of coding fine-tunes.

🤗 GGUF variant available: autotrust/gemma4-31B-Fable-5-Distilled-GGUF — F16 + Q8_0 with multimodal projector, runs on llama.cpp / Ollama / LM Studio / Jan.


🏆 Benchmark Results

Model HumanEval pass@1 Δ vs Base
gemma4-31B-Fable-5-Distilled (ours) 92.7% (152/164) +15.9 pts
google/gemma-4-31B-it (official) 76.8% baseline

Evaluation: HumanEval (164 Python problems), vLLM 0.22, T=0.1, thinking=off, batch generate. Identical result (92.7%) reproduced via vLLM server API with --reasoning-parser gemma4 at T=0.2.

Why it matters: We achieve this lift with only 0.20% of parameters trainable (61.2M / 31.27B) and without degrading multimodal vision — see Layer-Freezing Strategy below.


🔬 Layer-Freezing Strategy: Preserving Multimodal Vision

Most coding fine-tunes of multimodal models destroy the vision-language fusion learned during base pretraining. We avoid this by applying LoRA adapters only to the upper half of the transformer stack:

┌─────────────────────────────────────────────────────────────┐
│  Layers 30–59  │  🟢 LoRA-adapted (language head)           │  ← coding & tool-use uplift
│   (30 layers)  │     Q/K/V/O + gate/up/down projections     │
├─────────────────────────────────────────────────────────────┤
│  Layers 0–29   │  🔒 FROZEN (multimodal fusion)             │  ← vision preserved exactly
│   (30 layers)  │     Visual feature processing untouched    │
└─────────────────────────────────────────────────────────────┘
            ▲
            │
    Vision encoder (mmproj) — fully frozen
Layers State Role
0–29 🔒 Frozen Low-level multimodal fusion, visual features
30–59 🟢 LoRA Higher-level language & generation

Result: image description quality on held-out samples matches the base model bit-for-bit, while coding pass@1 lifts +15.9 points. Trainable parameters cut nearly in half vs. naive full-layer LoRA (~122M → 61.2M).

LoRA target modules (regex-matched):

language_model.layers.{30..59}.(self_attn|mlp).(q_proj|k_proj|v_proj|o_proj|gate_proj|up_proj|down_proj)

📊 Model Details

Property Value
Base Model google/gemma-4-31B-it
Architecture Gemma4ForConditionalGeneration (text decoder + vision encoder)
Parameters 31.27B (bfloat16)
Fine-tuning Method LoRA
LoRA Rank r=16, α=32, dropout=0.05
Trainable Parameters 61.2M (0.20% of total)
Sequence Length 2048 tokens
Thinking Mode Enabled (native Gemma 4 multi-channel format)
License Gemma Terms of Use

📚 Dataset: Quality-First Curation

Source: Glint-Research/Fable-5-traces — 23,325 raw interaction records from Fable 5, an agentic coding assistant.

Final training set: 308 conversation pairs after rigorous quality filtering.

Why so few? Quality-first curation. Each retained example is a complete tool-use conversation with verified outputs — full thinking traces, valid tool calls, and successful resolutions. In our ablations, this small high-signal set outperformed larger but noisier datasets (10K+ raw pairs) on both HumanEval and tool-use evaluations. The +15.9 point HumanEval lift is achieved on 308 examples, demonstrating that for post-training of strong base models, example quality dominates example count.

Preprocessing pipeline:

  1. Filter to type == "message" records only
  2. Group user–assistant message pairs by parentId
  3. Apply Gemma 4 chat template with full thinking + tool-call structure
  4. Completion-only loss masking: prompt → -100, only assistant response contributes to loss
  5. Drop samples > 2048 tokens
  6. Final: 308 high-quality conversation pairs

Each training example contains:

  • Thinking blocks (type: "thinking") — chain-of-thought reasoning
  • Tool calls (type: "toolCall") — structured invocations with name + arguments
  • Text blocks (type: "text") — final response

⚙️ Prompt Loss Masking

Loss is computed only on assistant response tokens (thinking + tool calls + final text). Prompt tokens (system + user) are labeled -100, so the model is never penalized for failing to predict user input.

input_ids = prompt_ids + completion_ids
labels    = [-100] * len(prompt_ids) + completion_ids

🛠 Training Hyperparameters

Hyperparameter Value
Optimizer AdamW (default)
Learning Rate 2e-4
LR Scheduler Cosine
Warmup Steps 50
Batch Size (per device) 1
Gradient Accumulation 16 (effective batch = 16)
Precision bfloat16
Gradient Checkpointing Enabled
Epochs 1

🚀 Usage

Text reasoning with thinking

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "autotrust/gemma4-31B-Fable-5-Distilled"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

messages = [
    {"role": "system", "content": "You are a helpful coding assistant."},
    {"role": "user",   "content": "Write a Python function to reverse a linked list."},
]
prompt = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True, enable_thinking=True
)
inputs = processor(text=prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=False))

Multimodal (image + text)

from PIL import Image

image = Image.open("path/to/image.png").convert("RGB")

messages = [{
    "role": "user",
    "content": [
        {"type": "image"},
        {"type": "text", "text": "Describe this image in detail."},
    ],
}]

prompt = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=256)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

Serving with vLLM

pip install vllm
vllm serve autotrust/gemma4-31B-Fable-5-Distilled --reasoning-parser gemma4

🎯 Intended Use

  • Agentic code generation & explanation with chain-of-thought reasoning
  • Tool-use planning with structured JSON tool-call outputs
  • Image description & visual reasoning (multimodal capability fully preserved)
  • General-purpose chat with thinking mode

⚠️ Known Limitations

  • Small fine-tuning set: 308 examples. May not generalize to all coding domains; consider further fine-tuning on your domain.
  • Thinking-mode dependency: The model was trained with enable_thinking=True. Responses without thinking may be suboptimal — keep thinking on for production use.
  • Tool calls are JSON-serialized (not bound to a runtime). You provide the execution layer.
  • Inherits Gemma base limitations: factual recall errors, occasional hallucination — pair with retrieval for production knowledge tasks.

📈 Evaluation: HumanEval Details

Configurations tested:

Configuration Pass@1 Engine Settings
vLLM offline batch 92.7% (152/164) vLLM 0.22 T=0.1, thinking=off, batch generate
vLLM server API 92.7% (152/164) vLLM 0.22 T=0.2, thinking=off, --reasoning-parser gemma4
Google Official (base) 76.8% (internal) T=0.1, thinking=on, base gemma-4-31B-it

Failure analysis (12 / 164 failed)

Type Count Detail
Missing imports 8 re (4), math (3), decimal (1), hashlib (1) — model omits stdlib imports
Logical errors 4 Code compiles but fails test assertions

The missing-import failures suggest a remediable distillation artifact (Fable 5 traces often elide stdlib imports). A future revision will rebalance the dataset to retain explicit imports.

Methodology

  • Dataset: openai/openai_humaneval (164 problems)
  • System prompt: "You are a Python coding assistant. Return ONLY the completed function inside python ... ."
  • User prompt: "Complete this Python function: python\n{prompt}\n"
  • Extraction: Parse markdown code blocks → strip signature/imports → normalize indentation
  • Verification: Standard prompt + body + test + check(entry_point) harness, 5s timeout

Thinking mode note

On vLLM, enable_thinking=True with --reasoning-parser gemma4 produces verbose thinking traces that can exceed the token budget, resulting in finish_reason=length and empty content. Google AI Studio API handles this correctly by separating thinking from the final answer. Benchmarks above use enable_thinking=False for reliable extraction. For interactive use, keep thinking on with a higher max_tokens budget (we recommend ≥ 1024).


🔗 Related Models


📖 Citation

@misc{autotrust2026gemma4fable5,
  title        = {Gemma-4-31B-Fable-5-Distilled: Layer-Frozen LoRA Distillation
                  Preserving Multimodal Vision},
  author       = {{AutoTrust AI Lab} and Yu, Cloud},
  year         = {2026},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/autotrust/gemma4-31B-Fable-5-Distilled}},
  note         = {Contact: cloud.yu@autotrust.ai}
}

🏛 About AutoTrust AI Lab

AutoTrust AI Lab builds open foundation models and agentic systems for scientific research and coding. Our flagship products are PaperGuru AI (agentic academic research) and the upcoming ScienceGuru.

We welcome community feedback, benchmarks, and quantization contributions — open an issue in the Community tab.

Downloads last month
185
Safetensors
Model size
31B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for autotrust/gemma4-31B-Fable-5-Distilled

Adapter
(109)
this model
Adapters
1 model
Quantizations
1 model

Dataset used to train autotrust/gemma4-31B-Fable-5-Distilled

Evaluation results