ToolGrad 12B

ToolGrad 12B is a fine-tuned version of google/gemma-3-12b-it optimized for function calling and tool-use tasks. It is trained on the dataset generated using the method described in our paper ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" (ACL 2026 Finding). The codebase is available at our GitHub Repository.

Model Details

Developed by: Zhongyi Zhou
Model Type: Causal Language Model
Base Model: google/gemma-3-12b-it
License: gemma-terms-of-use

Intended Use

Single-turn tool-use tasks.

Evaluation Results

Evaluated on the Berkeley Function Calling Leaderboard (BFCL) v1 & v2:

Overall & Hallucination Scores

Model	Non-live	Live	Halluc.
Model	Overall	Overall	Rel.	Irrel.
Gemma-3 12B	79.44%	74.24%	70.29%	93.75%
ToolGrad 12B	87.81% ↑	78.46% ↑	93.75%	59.27%

Detailed Category Scores

Model	Non-live				Live
Model	Simple	Multi	Par	MultiPar	Simple	Multi	Par	MultiPar
Gemma-3 12B	76.25%	94.00%	91.00%	56.50%	85.66%	71.89%	87.50%	45.83%
ToolGrad 12B	75.25% ↓	94.00% ↑	93.50% ↑	88.50% ↑	85.66% ↑	77.11% ↑	75.00% ↓	62.50% ↑

How to Get Started

You can load this model using the transformers library:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "zhongyi-zhou/toolgrad-12b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Citation

If you find this work helpful, please cite our paper:

@misc{zhou2026toolgradefficienttoolusedataset,
      title={ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"}, 
      author={Zhongyi Zhou and Kohei Uehara and Haoyu Zhang and Jingtao Zhou and Lin Gu and Ruofei Du and Zheng Xu and Tatsuya Harada},
      year={2026},
      eprint={2508.04086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.04086}, 
}