Instructions to use barha/granite-4.1-3b-tool-selector with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use barha/granite-4.1-3b-tool-selector with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-4.1-3b") model = PeftModel.from_pretrained(base_model, "barha/granite-4.1-3b-tool-selector") - Notebooks
- Google Colab
- Kaggle
granite-4.1-3b-tool-selector
A LoRA adapter for ibm-granite/granite-4.1-3b that learns to pick the right function(s) to call from a list of available tools, given a natural-language query.
The adapter does selection only: it answers which tools to call, not what arguments to pass. Output is always a JSON object with a selected_tools list.
Inputs and outputs
Input (user turn, JSON-encoded):
{
"query": "What's the weather in Paris and the time in Tokyo?",
"tools": [
{"name": "get_weather", "description": "Get the current weather for a city."},
{"name": "get_time", "description": "Get the current local time for a city."},
{"name": "send_email", "description": "Send an email to a recipient."}
]
}
Output (assistant turn, JSON):
{"selected_tools": ["get_weather", "get_time"]}
The full prompt uses Granite's role-tagged tokens directly (the base tokenizer has no chat_template):
<|start_of_role|>system<|end_of_role|>You are a tool-selection assistant. Given a user query and a list of available tools, return the names of the tools that should be called.<|end_of_text|>
<|start_of_role|>user<|end_of_role|>{...JSON above...}<|end_of_text|>
<|start_of_role|>assistant<|end_of_role|>
How to use
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("ibm-granite/granite-4.1-3b", torch_dtype="bfloat16", device_map="auto")
tokenizer = AutoTokenizer.from_pretrained("barha/granite-4.1-3b-tool-selector")
model = PeftModel.from_pretrained(base, "barha/granite-4.1-3b-tool-selector")
model.eval()
SYSTEM = "You are a tool-selection assistant. Given a user query and a list of available tools, return the names of the tools that should be called."
user_payload = {
"query": "What's the weather in Paris?",
"tools": [{"name": "get_weather", "description": "Get the current weather for a city."}],
}
import json
prompt = (
f"<|start_of_role|>system<|end_of_role|>{SYSTEM}<|end_of_text|>\n"
f"<|start_of_role|>user<|end_of_role|>{json.dumps(user_payload)}<|end_of_text|>\n"
f"<|start_of_role|>assistant<|end_of_role|>"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
gen = tokenizer.decode(out[0, inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(gen.split("<|end_of_text|>", 1)[0].strip())
Training data
Salesforce/xlam-function-calling-60k (gated dataset on Hugging Face), filtered to keep only rows whose answers list is non-empty (no-call abstentions are excluded). 90/10 train/val split, deterministic (no shuffle), seed 42.
For each row, the JSON-encoded tools list is reduced to {name, description} pairs (parameters dropped โ the adapter routes by name only) and concatenated with the query in the user turn.
After filtering: 54,000 train / 6,000 validation examples.
Training procedure
- Base model:
ibm-granite/granite-4.1-3b(bf16, ungated, Apache 2.0) - Adapter: LoRA,
r=8,alpha=16,dropout=0.05,bias="none" - Target modules:
q_proj,k_proj,v_proj,o_proj - Trainable parameters:
5.2 M (0.17 % of base) - Epochs: 3
- Optimizer: AdamW,
lr=2e-4, cosine schedule, warmup ratio 0.03 - Effective batch size: 16 (4 GPUs ร per-device 4 ร accum 1)
- Max sequence length: 2048
- Mixed precision: bf16
- Gradient checkpointing: non-reentrant
- Hardware: 4ร NVIDIA A100 80GB (single node)
- Wall-clock: ~78 min training (post-training eval was abandoned)
Loss curve
| epoch | step | train loss | eval loss |
|---|---|---|---|
| 0.41 | 1400 | 0.55 | 0.554 |
| 0.95 | 3200 | 0.39 | โ |
| 1.41 | 4760 | 0.38 | โ |
| 2.03 | 6850 | 0.32 | โ |
| 2.84 | 9580 | 0.33 | 0.383 |
| 2.96 | 10000 | 0.31 | 0.383 |
Eval loss dropped from 0.554 โ 0.383 with no signs of overfitting at the end of epoch 3.
Evaluation
Greedy generation on the held-out 6,000-example val split (same 90/10 deterministic slice of xlam-function-calling-60k used during training; the adapter never saw these examples). Predictions parsed from the JSON selected_tools field, scored as sets of tool names.
| metric | value |
|---|---|
| exact set match | 0.9930 (5,958 / 6,000) |
| macro F1 | 0.9960 |
| precision | 0.9963 |
| recall | 0.9960 |
| parse failure rate | 0.0007 (4 / 6,000) |
Generation wall-time: ~15.4 min on 1ร A100 (bs=8, max_new_tokens=128, greedy).
Reproduce with train/tool_selector/eval_adapter.py and the train/jobs/tool-selector-eval.yaml AppWrapper.
Caveat โ in-distribution only. Train and val are both deterministic slices of the same dataset, so this is an upper bound under matched distribution. The numbers do not speak to OOD generalization (different tool taxonomies, ambiguous queries, adversarial tool lists). Run your own behavioral eval against your real tool set before relying on this adapter in production.
- Downloads last month
- 58
Model tree for barha/granite-4.1-3b-tool-selector
Base model
ibm-granite/granite-4.1-3b