ToolGrad 4B

ToolGrad 4B is a fine-tuned version of google/gemma-3-4b-it optimized for function calling and tool-use tasks. It is trained on the dataset generated using the method described in our paper ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients" (ACL 2026 Finding). The codebase is available at our GitHub Repository.

Model Details

Developed by: Zhongyi Zhou
Model Type: Causal Language Model
Base Model: google/gemma-3-4b-it
License: gemma-terms-of-use

Intended Use

Single-turn tool-use tasks.

Evaluation Results

Evaluated on the Berkeley Function Calling Leaderboard (BFCL) v1 & v2:

Overall & Hallucination Scores

Model	Non-live	Live	Halluc.
Model	Overall	Overall	Rel.	Irrel.
Gemma-3 4B	61.12%	60.84%	53.94%	100.00%
ToolGrad 4B	72.46% ↑	65.58% ↑	93.75%	59.27%

Detailed Category Scores

Model	Non-live				Live
Model	Simple	Multi	Par	MultiPar	Simple	Multi	Par	MultiPar
Gemma-3 4B	64.50%	88.00%	56.00%	36.00%	70.93%	59.35%	25.00%	41.67%
ToolGrad 4B	65.33% ↑	86.50% ↓	73.00% ↑	65.00% ↑	71.32% ↑	64.86% ↑	43.75% ↑	50.00% ↑

How to Get Started

You can load this model using the transformers library:

import torch
from transformers import AutoModelForCausalLM, AutoProcessor

model_id = "zhongyi-zhou/toolgrad-4b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

Citation

If you find this work helpful, please cite our paper:

@misc{zhou2026toolgradefficienttoolusedataset,
      title={ToolGrad: Efficient Tool-use Dataset Generation with Textual "Gradients"}, 
      author={Zhongyi Zhou and Kohei Uehara and Haoyu Zhang and Jingtao Zhou and Lin Gu and Ruofei Du and Zheng Xu and Tatsuya Harada},
      year={2026},
      eprint={2508.04086},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2508.04086}, 
}