File size: 5,234 Bytes
9515533
a27c248
9515533
 
 
 
 
 
 
 
e1f53b5
 
9515533
 
2c0ee21
9515533
67508e8
9515533
67508e8
 
a27c248
67508e8
 
c28d5cf
67508e8
 
3747ba6
 
 
d32572d
2336ff0
 
 
 
 
3747ba6
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2336ff0
 
dfeece3
 
 
48d10bf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
67508e8
 
 
 
 
 
 
 
 
 
7e4a7be
 
 
 
 
 
 
 
d32572d
dfeece3
 
 
2c0ee21
 
dfeece3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7e4a7be
 
 
 
 
42de0d3
 
7e4a7be
e8f2419
2c0ee21
7e4a7be
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
base_model: Qwen/Qwen2.5-0.5B-Instruct
tags:
- text-generation-inference
- transformers
- unsloth
- qwen2
license: apache-2.0
language:
- en
datasets:
- quotientai/limbic-eval-tool-use-mcp
---

# Limbic-Tool-Use MCP Function Call Evaluator

This model is a fine-tuned version of Qwen2.5-0.5B-Instruct specifically designed for evaluating function calls in the context of Model Context Protocol (MCP) tools. It can assess whether a function call is correct, uses the wrong tool, has incorrect parameter names, or has incorrect parameter values.

## Model Details

- **Base Model**: [Qwen/Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct)
- **Fine-tuning Method**: LoRA (Low-Rank Adaptation)
- **Task**: Function Call Evaluation for MCP (Model Context Protocol)
- **Training Data**: MCP Server Tools data from public MCP servers, with augmentation / synthetic data generation
- **Model Size**: ~40MB (LoRA adapters only)
- **Context Length**: 32,768 tokens

# Model Usage

## Model Prompts

The prompt for the model takes two inputs: 
- `available_tools` - a list of the tool schemas
- `message_history` - the user request and model tool call response as a list of jsons

```
EVALUATOR_PROMPT = """\
# TOOL CALL EVALUATION RUBRIC

## EVALUATION CRITERIA

### 1. TOOL SELECTION
- [ ] Function name exists in available tools
- [ ] Function purpose matches user intent

### 2. PARAMETER STRUCTURE  
- [ ] All required and relevant parameters are present
- [ ] No hallucinated parameter names
- [ ] Parameter names match tool schema exactly

### 3. PARAMETER VALUES
- [ ] Data types match expected types
- [ ] Values align with user request
- [ ] No fabricated or incorrect values

## CLASSIFICATION RULES
- All criteria passed → `correct`
- Failed criteria 1 → `incorrect_tool`
- Failed criteria 2 → `incorrect_parameter_names`  
- Failed criteria 3 → `incorrect_parameter_values`

---
### AVAILABLE TOOLS
{available_tools}

---
### MESSAGE HISTORY
{message_history}

---
## OUTPUT REQUIREMENT
{{
    "score": < correct | incorrect_tool | incorrect_parameter_names | incorrect_parameter_values >,
    "reason": < [if incorrect, provide a brief list of reasons] >
}}

### EVALUATION:
"""
```
```
SYSTEM_PROMPT = "You are an expert evaluator of function calls. You will be given a function call and a list of available tools. You will need to evaluate the function call and return a score and a reason for the score."
```

### Example Inputs
```
available_tools = [
    {
        "name": "google-play-developer",
        "description": "Get apps by a developer on Google Play",
        "input_schema": {
            "type": "object",
            "properties": {
                "devId": {"type": "string", "description": "Developer ID"},
                "num": {"type": "number", "default": 60, "description": "Number of results"},
                "lang": {"type": "string", "default": "en", "description": "Language code"},
                "country": {"type": "string", "default": "us", "description": "Country code"}
            },
            "required": ["devId"]
        }
    }
]

message_history = [
    {"role": "user", "content": "I'm looking to evaluate the performance of all the apps developed by 'Example Developer' on the Google Play Store. Could you provide me with a list of their recent applications, specifically in English and focused on the US market? Please limit the results to 50 apps for a quicker review."},
    {"role": "assistant", "content": {"function": "name": "google-play-developer", "arguments": {"devId": "com.example.developer", "num": 50, "lang": "en", "country": "us"}}}
]
```

## Output Format
The model outputs evaluations in JSON format:

```json
{
    "score": "correct|incorrect_tool|incorrect_parameter_names|incorrect_parameter_values",
    "reason": ["reasons for failure if incorrect"]
}
```

#### Score Categories

- **correct**: Function call matches available tools and parameters exactly
- **incorrect_tool**: Function name doesn't exist in available tools
- **incorrect_parameter_names**: Function exists but parameter names are wrong
- **incorrect_parameter_values**: Function and parameters exist but values are inappropriate


## Load the Model
```
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
model = AutoModelForCausalLM.from_pretrained("quotientai/limbic-tool-use-0.5B-32K")
```

## Generate a Prediction
To make a prediction, you must convert the formatted prompt into its chat format.
```
chat_template = [
  {"role": "system", "content": SYSTEM_PROMPT},
  {"role": "user", "content": "<your-formatted-user-prompt>"}
]
# Apply the chat template
text = tokenizer.apply_chat_template(chat_template, tokenize=False, add_generation_prompt=True)

# Tokenize with truncation
inputs = tokenizer(text, return_tensors="pt", truncation=True).to("cuda")

# Generate your prediction
result = model.generate(**inputs, max_new_tokens=128, use_cache=True)
```

## Citation
```bibtex
@model{limbic-tool-use-0.5B-32K,
  title={Limbic Tool Use Evaluator},
  author={QuotientAI},
  year={2025},
  url={https://huggingface.co/quotientai/limbic-tool-use-0.5B-32K}
}
```