ToolForge β€” Qwen2.5-7B Tool Router (QLoRA r=64)

A QLoRA adapter that turns Qwen2.5-7B-Instruct into a fast, specialized tool-routing model: given a user query, it decides which of 9 tools to call (or to answer directly) and emits a structured tool call.

It replaces brittle regex/heuristic routers in agent pipelines with a small, self-hostable learned router.


What it does

Routes a query to one (or several) of these tools, or to a direct answer:

web_search, calculator, weather, wikipedia, datetime, dictionary, translate, unit_converter, web_reader β€” plus no_tool (answer directly) and multi_tool (chained calls).

Output format:

<tool_calls>[{"name": "weather", "arguments": {"location": "Tokyo"}}]</tool_calls>

Evaluation (honest, non-circular)

Measured on a hand-written, non-circular test set (36 realistic, indirectly phrased queries, hand-labeled β€” no teacher model involved), comparing the base model against this adapter on identical inputs. Grading is format-agnostic: a prediction counts if the correct tool is identified in any recognizable format, so the base model isn't penalized for not using the trained format.

Model Routing accuracy Strict-format accuracy
Base Qwen2.5-7B-Instruct 75.0% 75.0%
ToolForge (this adapter) 83.3% 83.3%
Gain from fine-tuning +8.3 pp +8.3 pp

Key point: strict and lenient scores are identical for both models β€” base Qwen already emits parseable tool-call formats, so the improvement comes from better routing decisions, not output formatting. Gains concentrate on disambiguating web_search vs wikipedia, unit_converter vs calculator, and multi-tool selection.

A separate ablation on a held-out split of the (teacher-labeled) synthetic data reports ~86%, but that number is partly circular and is best read as an internal hyperparameter comparison. The table above is the unbiased estimate.


Limitations

  • Fixed tool set. This is a specialist router for the 9 tools above. It does not generalize to arbitrary, prompt-supplied function schemas the way a general function-calling model does. Adding a tool requires retraining. The tradeoff is intentional: a small, cheap, self-hostable router for a known tool set, instead of a large general model on every call.
  • Over-triggering on chit-chat. Fine-tuning slightly increases the tendency to call a tool on no-tool conversational queries (e.g. "what is 2 plus 2") β€” a precision/recall tradeoff.
  • Trained on synthetic data (template-generated + Gemini-distilled), English only.

How to use

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base_id = "Qwen/Qwen2.5-7B-Instruct"
tok = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "Ayush0110/toolforge-qwen7b-r64")
model.eval()

SYS = ("You are a tool-routing assistant. Given a user query, decide which tool(s) "
       "to call and with what arguments. If no tool is needed, respond directly. "
       "You have access to: web_search, calculator, weather, wikipedia, datetime, "
       "dictionary, translate, unit_converter, web_reader. "
       'Output tool calls as: <tool_calls>[{"name": "tool", "arguments": {...}}]</tool_calls>')

msgs = [{"role": "system", "content": SYS},
        {"role": "user", "content": "is it jacket weather in Copenhagen right now"}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
# -> <tool_calls>[{"name": "weather", "arguments": {"location": "Copenhagen"}}]</tool_calls>

Training details

Base Qwen/Qwen2.5-7B-Instruct
Quantization 4-bit NF4 + double quant
LoRA r=64, Ξ±=128, dropout=0.05, targets: q,k,v,o,gate,up,down
Optimizer / LR AdamW, 2e-4 cosine, 10% warmup
Batch 4 Γ— 4 grad-accum = 16 effective
Epochs 3 (best at eval_loss β‰ˆ 0.14)
Data 1,173 examples (template-generated + Gemini-2.5-flash distilled)
Hardware single T4 (16GB), ~2.4 h
Tracking Weights & Biases

License

Apache-2.0 (inherits from the Qwen2.5 base model).

Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Ayush0110/toolforge-qwen7b-r64

Base model

Qwen/Qwen2.5-7B
Adapter
(2225)
this model