ono-gemma-4-12b-fable5-agent

This model is not for production use. It is an experimental research checkpoint for exploration and evaluation only. Do not deploy it in live agent systems without additional training, guardrails, and validation.

Gemma 4 12B IT full fine-tuned on Fable-5 agent traces for chain-of-thought reasoning and tool calling. The model emits thought reasoning followed by a structured call with tool name and JSON arguments — matching the Fable-5 trace format used by coding agents.

Base google/gemma-4-12B-it
Method Full fine-tune (text LM weights, not LoRA)
Visibility Private

Training

Item Value
Dataset tool_use rows only (~3,600), CoT capped at 1,200 chars
Train / val split 95% / 5% (seed=42)
Epochs 3
Learning rate 1e-5 (cosine, 3% warmup)
Effective batch size 16 (batch 1 × grad accum 16)
Max sequence length 3,072 tokens
Loss masking User + CoT masked → train only on call JSON
Optimizer AdamW 8-bit
GPU NVIDIA H200 on Modal
Train loss 0.937
Eval loss 0.400
Training time ~3h 48m

Vision and audio towers are present in the unified Gemma 4 checkpoint but were frozen during text-only training.

Evaluation

Batch evaluation on 50 held-out Fable-5 samples (seed=42, max_new_tokens=1024, temperature=0.2):

Metric Result
Tool name accuracy 56%
call block emitted 96%
Parseable tool JSON 94%

These numbers are indicative only and do not meet production reliability thresholds.

Recommended inference settings:

Parameter Value
max_new_tokens 1024
temperature 0.2
do_sample true (or greedy for max consistency)

Prompt format

Each turn follows Gemma chat tokens with an explicit thought → call structure:

<start_of_turn>user
{agent context: tool defs, history, task}<end_of_turn>
<start_of_turn>model
thought
{chain-of-thought reasoning}
call
{'tool': 'Edit', 'input': {'file_path': '...', 'old_string': '...', 'new_string': '...'}}<end_of_turn>

At inference, start the model turn and let it generate from thought:

prompt = (
    f"<start_of_turn>user\n{context}<end_of_turn>\n"
    f"<start_of_turn>model\nthought\n"
)

Quick start

import torch
from transformers import AutoModelForMultimodalLM, AutoTokenizer

model_id = "junwatu/ono-gemma-4-12b-fable5-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

context = "You are a coding agent. List all Python files in the current directory."
prompt = (
    f"<start_of_turn>user\n{context}<end_of_turn>\n"
    f"<start_of_turn>model\nthought\n"
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs["token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs["mm_token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.2,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=False,
)
print(response)

Important: Gemma 4 unified models require token_type_ids and mm_token_type_ids (all zeros for text-only) even when not using vision or audio.

Supported tools (from training data)

Common tool names seen in Fable-5 traces include Bash, Edit, Read, Write, Grep, WebSearch, TaskUpdate, PowerShell, and MCP-prefixed tools. Accuracy varies by tool type.

Limitations

  • Not for production — experimental checkpoint with ~56% tool accuracy on a small eval set; unsuitable for live agent deployment without further work.
  • Long contexts are truncated to 3,072 tokens during training.
  • Sampling matters — low temperature (0.2) and sufficient max_new_tokens (1024) are important for reliable call block generation.
  • Multimodal weights are included but unused; only text LM weights were fine-tuned.
  • Trained on a single agent trace style (Fable-5); may not generalize to other tool schemas without further fine-tuning.

License

Built on google/gemma-4-12B-it. Use is subject to the Gemma license terms. Fable-5 dataset: Glint-Research/Fable-5-traces.

Downloads last month
-
Safetensors
Model size
12B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for junwatu/ono-gemma-4-12b-fable5-agent

Finetuned
(78)
this model

Dataset used to train junwatu/ono-gemma-4-12b-fable5-agent