Gemma 4 E4B Fine-Tuned for Tool Calling — 95% accuracy, runs anywhere

by roshangrewal - opened 4 days ago

Discussion

roshangrewal

Owner 4 days ago

Released gemma4-e4b-toolcall-v02 — a production-grade tool-calling model built on Gemma 4 E4B-it (4B params).

Highlights

95% on multi-tool selection (BFCL benchmark)
90% on parallel function calling
88.5% on simple function calling (BFCL official)
Works with vLLM, Ollama, transformers, llama.cpp
OpenAI-compatible API out of the box
Apache 2.0 — fully commercial use

Quick Start

(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339]
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] █ █ █▄ ▄█
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] ▄▄ ▄█ █ █ █ ▀▄▀ █ version 0.23.0
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] █▄█▀ █ █ █ █ model roshangrewal/gemma4-e4b-toolcall-v02
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339] ▀▀ ▀▀▀▀▀ ▀▀▀▀▀ ▀ ▀
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:339]
(APIServer pid=542852) INFO 06-16 21:56:39 [api_utils.py:273] non-default args: {'model_tag': 'roshangrewal/gemma4-e4b-toolcall-v02', 'enable_auto_tool_choice': True, 'tool_call_parser': 'gemma4', 'model': 'roshangrewal/gemma4-e4b-toolcall-v02'}

What it does

Given available tools and a user query, the model:

Selects the correct tool from 12+ options (93%)
Extracts complex parameters from natural language (100%)
Knows when NOT to call a tool and responds directly (87.5%)
Handles multi-turn tool chains
Retains full conversational ability — tool-calling added on top, nothing removed

Available formats

Format	Link	Use case
Full model	gemma4-e4b-toolcall-v02	vLLM, transformers
LoRA adapter	gemma4-e4b-toolcall-v02-lora	Lightweight, further fine-tuning
GGUF Q8	gemma4-e4b-toolcall-v02-gguf	Ollama, llama.cpp, LM Studio

Training

Method: QLoRA (r=64) with Unsloth, 5000 steps
Data: 78K examples from NVIDIA Nemotron-SFT-Agentic-v2 + Glaive function-calling
Hardware: Single NVIDIA A100 80GB GPU, ~35 hours
Evaluation: 1000-query test dataset included in the repo for reproducibility

BFCL Submission

PR submitted to Berkeley Function Calling Leaderboard: gorilla#1344

Feedback welcome! The model card has full details including what didn't work (DPO, gradient issues with PEFT) and how we solved them.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment