Gemma 4 E4B Fine-Tuned for Tool Calling — 95% accuracy, runs anywhere

#1
by roshangrewal - opened

Released gemma4-e4b-toolcall-v02 — a production-grade tool-calling model built on Gemma 4 E4B-it (4B params).

Highlights

  • 95% on multi-tool selection (BFCL benchmark)
  • 90% on parallel function calling
  • 88.5% on simple function calling (BFCL official)
  • Works with vLLM, Ollama, transformers, llama.cpp
  • OpenAI-compatible API out of the box
  • Apache 2.0 — fully commercial use

Quick Start

(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339]
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] █ █ █▄ ▄█
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.23.0
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] █▄█▀ █ █ █ █ model roshangrewal/gemma4-e4b-toolcall-v02
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339]
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:273] non-default args: {'model_tag': 'roshangrewal/gemma4-e4b-toolcall-v02', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'roshangrewal/gemma4-e4b-toolcall-v02'}

What it does

Given available tools and a user query, the model:

  • Selects the correct tool from 12+ options (93%)
  • Extracts complex parameters from natural language (100%)
  • Knows when NOT to call a tool and responds directly (87.5%)
  • Handles multi-turn tool chains
  • Retains full conversational ability — tool-calling added on top, nothing removed

Available formats

Format Link Use case
Full model gemma4-e4b-toolcall-v02 vLLM, transformers
LoRA adapter gemma4-e4b-toolcall-v02-lora Lightweight, further fine-tuning
GGUF Q8 gemma4-e4b-toolcall-v02-gguf Ollama, llama.cpp, LM Studio

Training

  • Method: QLoRA (r=64) with Unsloth, 5000 steps
  • Data: 78K examples from NVIDIA Nemotron-SFT-Agentic-v2 + Glaive function-calling
  • Hardware: Single NVIDIA A100 80GB GPU, ~35 hours
  • Evaluation: 1000-query test dataset included in the repo for reproducibility

BFCL Submission

PR submitted to Berkeley Function Calling Leaderboard: gorilla#1344


Feedback welcome! The model card has full details including what didn't work (DPO, gradient issues with PEFT) and how we solved them.

Sign up or log in to comment