Instructions to use junwatu/ono-gemma-4-12b-fable5-agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use junwatu/ono-gemma-4-12b-fable5-agent with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="junwatu/ono-gemma-4-12b-fable5-agent")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("junwatu/ono-gemma-4-12b-fable5-agent")
model = AutoModelForMultimodalLM.from_pretrained("junwatu/ono-gemma-4-12b-fable5-agent")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use junwatu/ono-gemma-4-12b-fable5-agent with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "junwatu/ono-gemma-4-12b-fable5-agent"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "junwatu/ono-gemma-4-12b-fable5-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/junwatu/ono-gemma-4-12b-fable5-agent

SGLang

How to use junwatu/ono-gemma-4-12b-fable5-agent with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "junwatu/ono-gemma-4-12b-fable5-agent" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "junwatu/ono-gemma-4-12b-fable5-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "junwatu/ono-gemma-4-12b-fable5-agent" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "junwatu/ono-gemma-4-12b-fable5-agent",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use junwatu/ono-gemma-4-12b-fable5-agent with Docker Model Runner:
```
docker model run hf.co/junwatu/ono-gemma-4-12b-fable5-agent
```

ono-gemma-4-12b-fable5-agent

This model is not for production use. It is an experimental research checkpoint for exploration and evaluation only. Do not deploy it in live agent systems without additional training, guardrails, and validation.

Gemma 4 12B IT full fine-tuned on Fable-5 agent traces for chain-of-thought reasoning and tool calling. The model emits thought reasoning followed by a structured call with tool name and JSON arguments — matching the Fable-5 trace format used by coding agents.


Base	`google/gemma-4-12B-it`
Method	Full fine-tune (text LM weights, not LoRA)
Visibility	Private

Training

Item	Value
Dataset	`tool_use` rows only (~3,600), CoT capped at 1,200 chars
Train / val split	95% / 5% (seed=42)
Epochs	3
Learning rate	1e-5 (cosine, 3% warmup)
Effective batch size	16 (batch 1 × grad accum 16)
Max sequence length	3,072 tokens
Loss masking	User + CoT masked → train only on `call` JSON
Optimizer	AdamW 8-bit
GPU	NVIDIA H200 on Modal
Train loss	0.937
Eval loss	0.400
Training time	~3h 48m

Vision and audio towers are present in the unified Gemma 4 checkpoint but were frozen during text-only training.

Evaluation

Batch evaluation on 50 held-out Fable-5 samples (seed=42, max_new_tokens=1024, temperature=0.2):

Metric	Result
Tool name accuracy	56%
`call` block emitted	96%
Parseable tool JSON	94%

These numbers are indicative only and do not meet production reliability thresholds.

Recommended inference settings:

Parameter	Value
`max_new_tokens`	1024
`temperature`	0.2
`do_sample`	true (or greedy for max consistency)

Prompt format

Each turn follows Gemma chat tokens with an explicit thought → call structure:

<start_of_turn>user
{agent context: tool defs, history, task}<end_of_turn>
<start_of_turn>model
thought
{chain-of-thought reasoning}
call
{'tool': 'Edit', 'input': {'file_path': '...', 'old_string': '...', 'new_string': '...'}}<end_of_turn>

At inference, start the model turn and let it generate from thought:

prompt = (
    f"<start_of_turn>user\n{context}<end_of_turn>\n"
    f"<start_of_turn>model\nthought\n"
)

Quick start

import torch
from transformers import AutoModelForMultimodalLM, AutoTokenizer

model_id = "junwatu/ono-gemma-4-12b-fable5-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
    model_id,
    dtype=torch.bfloat16,
    device_map="auto",
)
model.eval()

context = "You are a coding agent. List all Python files in the current directory."
prompt = (
    f"<start_of_turn>user\n{context}<end_of_turn>\n"
    f"<start_of_turn>model\nthought\n"
)

inputs = tokenizer(prompt, return_tensors="pt")
inputs["token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs["mm_token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    output_ids = model.generate(
        **inputs,
        max_new_tokens=1024,
        temperature=0.2,
        top_p=0.9,
        do_sample=True,
        pad_token_id=tokenizer.pad_token_id,
        eos_token_id=tokenizer.eos_token_id,
    )

response = tokenizer.decode(
    output_ids[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=False,
)
print(response)

Important: Gemma 4 unified models require token_type_ids and mm_token_type_ids (all zeros for text-only) even when not using vision or audio.

Supported tools (from training data)

Common tool names seen in Fable-5 traces include Bash, Edit, Read, Write, Grep, WebSearch, TaskUpdate, PowerShell, and MCP-prefixed tools. Accuracy varies by tool type.

Limitations

Not for production — experimental checkpoint with ~56% tool accuracy on a small eval set; unsuitable for live agent deployment without further work.
Long contexts are truncated to 3,072 tokens during training.
Sampling matters — low temperature (0.2) and sufficient max_new_tokens (1024) are important for reliable call block generation.
Multimodal weights are included but unused; only text LM weights were fine-tuned.
Trained on a single agent trace style (Fable-5); may not generalize to other tool schemas without further fine-tuning.

License

Built on google/gemma-4-12B-it. Use is subject to the Gemma license terms. Fable-5 dataset: Glint-Research/Fable-5-traces.

Downloads last month: -

Safetensors

Model size

12B params

Tensor type

BF16

Model tree for junwatu/ono-gemma-4-12b-fable5-agent

Base model

google/gemma-4-12B

Finetuned

google/gemma-4-12B-it