Instructions to use Ayush0110/toolforge-qwen7b-r64 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use Ayush0110/toolforge-qwen7b-r64 with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct") model = PeftModel.from_pretrained(base_model, "Ayush0110/toolforge-qwen7b-r64") - Notebooks
- Google Colab
- Kaggle
ToolForge β Qwen2.5-7B Tool Router (QLoRA r=64)
A QLoRA adapter that turns Qwen2.5-7B-Instruct into a fast, specialized tool-routing model: given a user query, it decides which of 9 tools to call (or to answer directly) and emits a structured tool call.
It replaces brittle regex/heuristic routers in agent pipelines with a small, self-hostable learned router.
- Base model:
Qwen/Qwen2.5-7B-Instruct - Method: QLoRA (4-bit NF4, double quant), LoRA r=64 / Ξ±=128, all attn+MLP projections
- Code & full write-up: github.com/ayushh0110/toolforge
- Blog: From Heuristics to Fine-Tuning
What it does
Routes a query to one (or several) of these tools, or to a direct answer:
web_search, calculator, weather, wikipedia, datetime,
dictionary, translate, unit_converter, web_reader
β plus no_tool (answer directly) and multi_tool (chained calls).
Output format:
<tool_calls>[{"name": "weather", "arguments": {"location": "Tokyo"}}]</tool_calls>
Evaluation (honest, non-circular)
Measured on a hand-written, non-circular test set (36 realistic, indirectly phrased queries, hand-labeled β no teacher model involved), comparing the base model against this adapter on identical inputs. Grading is format-agnostic: a prediction counts if the correct tool is identified in any recognizable format, so the base model isn't penalized for not using the trained format.
| Model | Routing accuracy | Strict-format accuracy |
|---|---|---|
| Base Qwen2.5-7B-Instruct | 75.0% | 75.0% |
| ToolForge (this adapter) | 83.3% | 83.3% |
| Gain from fine-tuning | +8.3 pp | +8.3 pp |
Key point: strict and lenient scores are identical for both models β base
Qwen already emits parseable tool-call formats, so the improvement comes from
better routing decisions, not output formatting. Gains concentrate on
disambiguating web_search vs wikipedia, unit_converter vs calculator,
and multi-tool selection.
A separate ablation on a held-out split of the (teacher-labeled) synthetic data reports ~86%, but that number is partly circular and is best read as an internal hyperparameter comparison. The table above is the unbiased estimate.
Limitations
- Fixed tool set. This is a specialist router for the 9 tools above. It does not generalize to arbitrary, prompt-supplied function schemas the way a general function-calling model does. Adding a tool requires retraining. The tradeoff is intentional: a small, cheap, self-hostable router for a known tool set, instead of a large general model on every call.
- Over-triggering on chit-chat. Fine-tuning slightly increases the tendency to call a tool on no-tool conversational queries (e.g. "what is 2 plus 2") β a precision/recall tradeoff.
- Trained on synthetic data (template-generated + Gemini-distilled), English only.
How to use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base_id = "Qwen/Qwen2.5-7B-Instruct"
tok = AutoTokenizer.from_pretrained(base_id)
base = AutoModelForCausalLM.from_pretrained(base_id, torch_dtype=torch.bfloat16, device_map="auto")
model = PeftModel.from_pretrained(base, "Ayush0110/toolforge-qwen7b-r64")
model.eval()
SYS = ("You are a tool-routing assistant. Given a user query, decide which tool(s) "
"to call and with what arguments. If no tool is needed, respond directly. "
"You have access to: web_search, calculator, weather, wikipedia, datetime, "
"dictionary, translate, unit_converter, web_reader. "
'Output tool calls as: <tool_calls>[{"name": "tool", "arguments": {...}}]</tool_calls>')
msgs = [{"role": "system", "content": SYS},
{"role": "user", "content": "is it jacket weather in Copenhagen right now"}]
text = tok.apply_chat_template(msgs, tokenize=False, add_generation_prompt=True)
inputs = tok(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
# -> <tool_calls>[{"name": "weather", "arguments": {"location": "Copenhagen"}}]</tool_calls>
Training details
| Base | Qwen/Qwen2.5-7B-Instruct |
| Quantization | 4-bit NF4 + double quant |
| LoRA | r=64, Ξ±=128, dropout=0.05, targets: q,k,v,o,gate,up,down |
| Optimizer / LR | AdamW, 2e-4 cosine, 10% warmup |
| Batch | 4 Γ 4 grad-accum = 16 effective |
| Epochs | 3 (best at eval_loss β 0.14) |
| Data | 1,173 examples (template-generated + Gemini-2.5-flash distilled) |
| Hardware | single T4 (16GB), ~2.4 h |
| Tracking | Weights & Biases |
License
Apache-2.0 (inherits from the Qwen2.5 base model).
- Downloads last month
- 17