Instructions to use junwatu/ono-gemma-4-12b-fable5-agent with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use junwatu/ono-gemma-4-12b-fable5-agent with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="junwatu/ono-gemma-4-12b-fable5-agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("junwatu/ono-gemma-4-12b-fable5-agent") model = AutoModelForMultimodalLM.from_pretrained("junwatu/ono-gemma-4-12b-fable5-agent") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use junwatu/ono-gemma-4-12b-fable5-agent with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "junwatu/ono-gemma-4-12b-fable5-agent" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "junwatu/ono-gemma-4-12b-fable5-agent", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/junwatu/ono-gemma-4-12b-fable5-agent
- SGLang
How to use junwatu/ono-gemma-4-12b-fable5-agent with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "junwatu/ono-gemma-4-12b-fable5-agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "junwatu/ono-gemma-4-12b-fable5-agent", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "junwatu/ono-gemma-4-12b-fable5-agent" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "junwatu/ono-gemma-4-12b-fable5-agent", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use junwatu/ono-gemma-4-12b-fable5-agent with Docker Model Runner:
docker model run hf.co/junwatu/ono-gemma-4-12b-fable5-agent
ono-gemma-4-12b-fable5-agent
This model is not for production use. It is an experimental research checkpoint for exploration and evaluation only. Do not deploy it in live agent systems without additional training, guardrails, and validation.
Gemma 4 12B IT full fine-tuned on Fable-5 agent traces for chain-of-thought reasoning and tool calling. The model emits thought reasoning followed by a structured call with tool name and JSON arguments — matching the Fable-5 trace format used by coding agents.
| Base | google/gemma-4-12B-it |
| Method | Full fine-tune (text LM weights, not LoRA) |
| Visibility | Private |
Training
| Item | Value |
|---|---|
| Dataset | tool_use rows only (~3,600), CoT capped at 1,200 chars |
| Train / val split | 95% / 5% (seed=42) |
| Epochs | 3 |
| Learning rate | 1e-5 (cosine, 3% warmup) |
| Effective batch size | 16 (batch 1 × grad accum 16) |
| Max sequence length | 3,072 tokens |
| Loss masking | User + CoT masked → train only on call JSON |
| Optimizer | AdamW 8-bit |
| GPU | NVIDIA H200 on Modal |
| Train loss | 0.937 |
| Eval loss | 0.400 |
| Training time | ~3h 48m |
Vision and audio towers are present in the unified Gemma 4 checkpoint but were frozen during text-only training.
Evaluation
Batch evaluation on 50 held-out Fable-5 samples (seed=42, max_new_tokens=1024, temperature=0.2):
| Metric | Result |
|---|---|
| Tool name accuracy | 56% |
call block emitted |
96% |
| Parseable tool JSON | 94% |
These numbers are indicative only and do not meet production reliability thresholds.
Recommended inference settings:
| Parameter | Value |
|---|---|
max_new_tokens |
1024 |
temperature |
0.2 |
do_sample |
true (or greedy for max consistency) |
Prompt format
Each turn follows Gemma chat tokens with an explicit thought → call structure:
<start_of_turn>user
{agent context: tool defs, history, task}<end_of_turn>
<start_of_turn>model
thought
{chain-of-thought reasoning}
call
{'tool': 'Edit', 'input': {'file_path': '...', 'old_string': '...', 'new_string': '...'}}<end_of_turn>
At inference, start the model turn and let it generate from thought:
prompt = (
f"<start_of_turn>user\n{context}<end_of_turn>\n"
f"<start_of_turn>model\nthought\n"
)
Quick start
import torch
from transformers import AutoModelForMultimodalLM, AutoTokenizer
model_id = "junwatu/ono-gemma-4-12b-fable5-agent"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForMultimodalLM.from_pretrained(
model_id,
dtype=torch.bfloat16,
device_map="auto",
)
model.eval()
context = "You are a coding agent. List all Python files in the current directory."
prompt = (
f"<start_of_turn>user\n{context}<end_of_turn>\n"
f"<start_of_turn>model\nthought\n"
)
inputs = tokenizer(prompt, return_tensors="pt")
inputs["token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs["mm_token_type_ids"] = torch.zeros_like(inputs["input_ids"])
inputs = {k: v.to(model.device) for k, v in inputs.items()}
with torch.no_grad():
output_ids = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.2,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.pad_token_id,
eos_token_id=tokenizer.eos_token_id,
)
response = tokenizer.decode(
output_ids[0][inputs["input_ids"].shape[1]:],
skip_special_tokens=False,
)
print(response)
Important: Gemma 4 unified models require
token_type_idsandmm_token_type_ids(all zeros for text-only) even when not using vision or audio.
Supported tools (from training data)
Common tool names seen in Fable-5 traces include Bash, Edit, Read, Write, Grep, WebSearch, TaskUpdate, PowerShell, and MCP-prefixed tools. Accuracy varies by tool type.
Limitations
- Not for production — experimental checkpoint with ~56% tool accuracy on a small eval set; unsuitable for live agent deployment without further work.
- Long contexts are truncated to 3,072 tokens during training.
- Sampling matters — low temperature (0.2) and sufficient
max_new_tokens(1024) are important for reliablecallblock generation. - Multimodal weights are included but unused; only text LM weights were fine-tuned.
- Trained on a single agent trace style (Fable-5); may not generalize to other tool schemas without further fine-tuning.
License
Built on google/gemma-4-12B-it. Use is subject to the Gemma license terms. Fable-5 dataset: Glint-Research/Fable-5-traces.
- Downloads last month
- -