ToolGrad
Collection
[ACL 2026 Finding] ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients". GitHub Repo: https://github.com/zhongyi-zhou/toolgrad • 4 items • Updated • 1
ToolGrad 4B is a fine-tuned version of google/gemma-3-4b-it optimized for function calling and tool-use tasks. It is trained on the dataset generated using the method described in our paper ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" (ACL 2026 Finding). The codebase is available at our GitHub Repository.
Single-turn tool-use tasks.
Evaluated on the Berkeley Function Calling Leaderboard (BFCL) v1 & v2:
| Model | Non-live | Live | Halluc. | |
|---|---|---|---|---|
| Overall | Overall | Rel. | Irrel. | |
| Gemma-3 4B | 61.12% | 60.84% | 53.94% | 100.00% |
| ToolGrad 4B | 72.46% ↑ | 65.58% ↑ | 93.75% | 59.27% |
| Model | Non-live | Live | ||||||
|---|---|---|---|---|---|---|---|---|
| Simple | Multi | Par | MultiPar | Simple | Multi | Par | MultiPar | |
| Gemma-3 4B | 64.50% | 88.00% | 56.00% | 36.00% | 70.93% | 59.35% | 25.00% | 41.67% |
| ToolGrad 4B | 65.33% ↑ | 86.50% ↓ | 73.00% ↑ | 65.00% ↑ | 71.32% ↑ | 64.86% ↑ | 43.75% ↑ | 50.00% ↑ |
You can load this model using the transformers library:
import torch
from transformers import AutoModelForCausalLM, AutoProcessor
model_id = "zhongyi-zhou/toolgrad-4b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
If you find this work helpful, please cite our paper:
@misc{zhou2026toolgradefficienttoolusedataset,
title={ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"},
author={Zhongyi Zhou and Kohei Uehara and Haoyu Zhang and Jingtao Zhou and Lin Gu and Ruofei Du and Zheng Xu and Tatsuya Harada},
year={2026},
eprint={2508.04086},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.04086},
}