ztwang commited on
Commit
4966301
·
verified ·
1 Parent(s): 7b6b43e

Upload 22 files

Browse files
.gitattributes CHANGED
@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ mcp-bench.png filter=lfs diff=lfs merge=lfs -text
37
+ ranking.png filter=lfs diff=lfs merge=lfs -text
README.md CHANGED
@@ -1,12 +1,159 @@
1
- ---
2
- title: Mcp Bench
3
- emoji: 🌍
4
- colorFrom: red
5
- colorTo: gray
6
- sdk: static
7
- pinned: false
8
- license: mit
9
- short_description: Benchmarking Tool-Using LLM Agents with Real-World Tasks via
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
1
+ # MCP Benchmark Leaderboard
2
+
3
+ A modern, interactive web application displaying performance metrics for various Language Learning Models (LLMs) in the MCP (Model Control Protocol) benchmark.
4
+
5
+ ## 🏆 Features
6
+
7
+ - **Interactive Leaderboard**: Sort by any metric column
8
+ - **Real-time Search**: Filter models by name
9
+ - **Responsive Design**: Optimized for desktop and mobile
10
+ - **Visual Indicators**: Color-coded performance levels and progress bars
11
+ - **Modern UI**: Clean, professional Material Design interface
12
+ - **Dark Mode Support**: Automatic dark/light theme detection
13
+
14
+ ## 📊 Metrics Displayed
15
+
16
+ The leaderboard shows comprehensive performance metrics:
17
+
18
+ - **Overall Score**: Combined performance metric
19
+ - **Valid Tool Schema**: Percentage of valid tool schemas
20
+ - **Compliance**: Rule compliance percentage
21
+ - **Task Success**: Task completion success rate
22
+ - **Schema Understanding**: Understanding of tool schemas
23
+ - **Task Completion**: Task completion effectiveness
24
+ - **Tool Usage**: Tool utilization efficiency
25
+ - **Planning Effectiveness**: Planning and execution quality
26
+
27
+ ## 🚀 Quick Start
28
+
29
+ ### Local Development
30
+
31
+ 1. Clone this repository
32
+ 2. Open `index.html` in your web browser
33
+ 3. Or serve using a local HTTP server:
34
+
35
+ ```bash
36
+ # Using Python
37
+ python -m http.server 8000
38
+
39
+ # Using Node.js
40
+ npx serve .
41
+
42
+ # Using PHP
43
+ php -S localhost:8000
44
+ ```
45
+
46
+ ### Hugging Face Spaces Deployment
47
+
48
+ This project is optimized for deployment on Hugging Face Spaces:
49
+
50
+ 1. Create a new Space on [Hugging Face](https://huggingface.co/spaces)
51
+ 2. Choose **Gradio** as the SDK
52
+ 3. Upload all files to your Space
53
+ 4. Rename `requirements-hf.txt` to `requirements.txt`
54
+ 5. Your Space will automatically build and deploy
55
+
56
+ The `app.py` file provides Gradio integration for Hugging Face Spaces compatibility.
57
+
58
+ ## 📁 Project Structure
59
+
60
+ ```
61
+ mcp-bench-leaderboard/
62
+ ├── index.html # Main HTML page
63
+ ├── style.css # Responsive CSS styling
64
+ ├── script.js # Interactive JavaScript functionality
65
+ ├── data.json # Leaderboard data
66
+ ├── app.py # Gradio app for HF Spaces
67
+ ├── requirements-hf.txt # Dependencies for HF deployment
68
+ └── README.md # Documentation
69
+ ```
70
+
71
+ ## 🎨 Customization
72
+
73
+ ### Update Data
74
+
75
+ Modify `data.json` to add new models or update scores:
76
+
77
+ ```json
78
+ {
79
+ "lastUpdated": "2025-09-05",
80
+ "models": [
81
+ {
82
+ "name": "your-model-name",
83
+ "overall_score": 0.750,
84
+ "valid_tool_schema": 99.5,
85
+ "compliance": 98.2,
86
+ // ... other metrics
87
+ }
88
+ ]
89
+ }
90
+ ```
91
+
92
+ ### Styling
93
+
94
+ Edit `style.css` to customize:
95
+ - Colors and themes
96
+ - Layout and spacing
97
+ - Responsive breakpoints
98
+ - Animation effects
99
+
100
+ ### Functionality
101
+
102
+ Extend `script.js` to add:
103
+ - New sorting algorithms
104
+ - Additional filtering options
105
+ - Export functionality
106
+ - Chart visualizations
107
+
108
+ ## 🌐 Browser Support
109
+
110
+ - Chrome 60+
111
+ - Firefox 55+
112
+ - Safari 12+
113
+ - Edge 79+
114
+
115
+ ## 📱 Mobile Compatibility
116
+
117
+ The application is fully responsive and optimized for:
118
+ - Tablets (768px - 1024px)
119
+ - Mobile phones (320px - 767px)
120
+ - Large screens (1200px+)
121
+
122
+ ## 🔧 Technical Details
123
+
124
+ - **Pure Frontend**: No backend dependencies
125
+ - **Vanilla JavaScript**: No frameworks required
126
+ - **Modern CSS**: Flexbox, Grid, CSS Variables
127
+ - **Progressive Enhancement**: Works without JavaScript
128
+ - **SEO Friendly**: Semantic HTML structure
129
+
130
+ ## 📈 Performance
131
+
132
+ - Lightweight (~50KB total)
133
+ - Fast loading times
134
+ - Optimized images and assets
135
+ - Efficient DOM updates
136
+ - Smooth animations
137
+
138
+ ## 🤝 Contributing
139
+
140
+ 1. Fork the repository
141
+ 2. Create a feature branch
142
+ 3. Make your changes
143
+ 4. Test across browsers
144
+ 5. Submit a pull request
145
+
146
+ ## 📄 License
147
+
148
+ This project is open source and available under the MIT License.
149
+
150
+ ## 🙏 Acknowledgments
151
+
152
+ - Data sourced from MCP Benchmark Results
153
+ - Icons from Font Awesome
154
+ - Fonts from Google Fonts
155
+ - Hosted on Hugging Face Spaces
156
+
157
  ---
158
 
159
+ *Last updated: September 2025*
app.py ADDED
@@ -0,0 +1,45 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import gradio as gr
2
+ import os
3
+
4
+ def create_gradio_app():
5
+ """
6
+ Simple Gradio app to serve the static HTML leaderboard
7
+ This is required for Hugging Face Spaces deployment
8
+ """
9
+
10
+ # Read the HTML content
11
+ with open('index.html', 'r', encoding='utf-8') as f:
12
+ html_content = f.read()
13
+
14
+ # Read the CSS content
15
+ with open('style.css', 'r', encoding='utf-8') as f:
16
+ css_content = f.read()
17
+
18
+ # Read the JavaScript content
19
+ with open('script.js', 'r', encoding='utf-8') as f:
20
+ js_content = f.read()
21
+
22
+ # Combine everything into a single HTML page
23
+ combined_html = html_content.replace(
24
+ '<link rel="stylesheet" href="style.css">',
25
+ f'<style>{css_content}</style>'
26
+ ).replace(
27
+ '<script src="script.js"></script>',
28
+ f'<script>{js_content}</script>'
29
+ )
30
+
31
+ # Create the Gradio interface
32
+ with gr.Blocks(
33
+ title="MCP Benchmark Leaderboard",
34
+ theme=gr.themes.Soft(),
35
+ ) as demo:
36
+ gr.HTML(
37
+ combined_html,
38
+ elem_id="leaderboard-container"
39
+ )
40
+
41
+ return demo
42
+
43
+ if __name__ == "__main__":
44
+ demo = create_gradio_app()
45
+ demo.launch()
data.json ADDED
@@ -0,0 +1,285 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "lastUpdated": "2025-09-05",
3
+ "models": [
4
+ {
5
+ "name": "llama-3-1-8b-instruct",
6
+ "overall_score": 0.428,
7
+ "valid_tool_schema": 96.1,
8
+ "compliance": 89.4,
9
+ "task_success": 90.9,
10
+ "schema_understanding": 0.261,
11
+ "task_completion": 0.295,
12
+ "tool_usage": 0.352,
13
+ "planning_effectiveness": 0.310,
14
+ "task_information": 0.221,
15
+ "tool_parameter": 0.141,
16
+ "dependency": 0.428
17
+ },
18
+ {
19
+ "name": "llama-3-2-90b-vision-instruct",
20
+ "overall_score": 0.495,
21
+ "valid_tool_schema": 99.6,
22
+ "compliance": 85.0,
23
+ "task_success": 90.9,
24
+ "schema_understanding": 0.293,
25
+ "task_completion": 0.444,
26
+ "tool_usage": 0.515,
27
+ "planning_effectiveness": 0.427,
28
+ "task_information": 0.267,
29
+ "tool_parameter": 0.173,
30
+ "dependency": 0.495
31
+ },
32
+ {
33
+ "name": "nova-micro-v1",
34
+ "overall_score": 0.508,
35
+ "valid_tool_schema": 96.0,
36
+ "compliance": 93.1,
37
+ "task_success": 87.8,
38
+ "schema_understanding": 0.339,
39
+ "task_completion": 0.419,
40
+ "tool_usage": 0.504,
41
+ "planning_effectiveness": 0.428,
42
+ "task_information": 0.315,
43
+ "tool_parameter": 0.212,
44
+ "dependency": 0.508
45
+ },
46
+ {
47
+ "name": "llama-3-1-70b-instruct",
48
+ "overall_score": 0.510,
49
+ "valid_tool_schema": 99.2,
50
+ "compliance": 90.5,
51
+ "task_success": 92.5,
52
+ "schema_understanding": 0.314,
53
+ "task_completion": 0.432,
54
+ "tool_usage": 0.523,
55
+ "planning_effectiveness": 0.451,
56
+ "task_information": 0.287,
57
+ "tool_parameter": 0.191,
58
+ "dependency": 0.510
59
+ },
60
+ {
61
+ "name": "mistral-small-2503",
62
+ "overall_score": 0.530,
63
+ "valid_tool_schema": 96.4,
64
+ "compliance": 95.6,
65
+ "task_success": 86.2,
66
+ "schema_understanding": 0.373,
67
+ "task_completion": 0.445,
68
+ "tool_usage": 0.537,
69
+ "planning_effectiveness": 0.446,
70
+ "task_information": 0.349,
71
+ "tool_parameter": 0.232,
72
+ "dependency": 0.530
73
+ },
74
+ {
75
+ "name": "gpt-4o-mini",
76
+ "overall_score": 0.557,
77
+ "valid_tool_schema": 97.5,
78
+ "compliance": 98.1,
79
+ "task_success": 93.9,
80
+ "schema_understanding": 0.374,
81
+ "task_completion": 0.500,
82
+ "tool_usage": 0.555,
83
+ "planning_effectiveness": 0.544,
84
+ "task_information": 0.352,
85
+ "tool_parameter": 0.201,
86
+ "dependency": 0.557
87
+ },
88
+ {
89
+ "name": "llama-3-3-70b-instruct",
90
+ "overall_score": 0.558,
91
+ "valid_tool_schema": 99.5,
92
+ "compliance": 93.8,
93
+ "task_success": 91.6,
94
+ "schema_understanding": 0.349,
95
+ "task_completion": 0.493,
96
+ "tool_usage": 0.583,
97
+ "planning_effectiveness": 0.525,
98
+ "task_information": 0.355,
99
+ "tool_parameter": 0.262,
100
+ "dependency": 0.558
101
+ },
102
+ {
103
+ "name": "gemma-3-27b-it",
104
+ "overall_score": 0.582,
105
+ "valid_tool_schema": 98.8,
106
+ "compliance": 97.6,
107
+ "task_success": 94.4,
108
+ "schema_understanding": 0.378,
109
+ "task_completion": 0.530,
110
+ "tool_usage": 0.608,
111
+ "planning_effectiveness": 0.572,
112
+ "task_information": 0.383,
113
+ "tool_parameter": 0.249,
114
+ "dependency": 0.582
115
+ },
116
+ {
117
+ "name": "gpt-4o",
118
+ "overall_score": 0.595,
119
+ "valid_tool_schema": 98.9,
120
+ "compliance": 98.3,
121
+ "task_success": 92.8,
122
+ "schema_understanding": 0.394,
123
+ "task_completion": 0.542,
124
+ "tool_usage": 0.627,
125
+ "planning_effectiveness": 0.587,
126
+ "task_information": 0.405,
127
+ "tool_parameter": 0.272,
128
+ "dependency": 0.595
129
+ },
130
+ {
131
+ "name": "gemini-2.5-flash-lite",
132
+ "overall_score": 0.598,
133
+ "valid_tool_schema": 99.4,
134
+ "compliance": 97.8,
135
+ "task_success": 94.3,
136
+ "schema_understanding": 0.412,
137
+ "task_completion": 0.577,
138
+ "tool_usage": 0.627,
139
+ "planning_effectiveness": 0.597,
140
+ "task_information": 0.404,
141
+ "tool_parameter": 0.226,
142
+ "dependency": 0.598
143
+ },
144
+ {
145
+ "name": "qwen3-30b-a3b-instruct-2507",
146
+ "overall_score": 0.627,
147
+ "valid_tool_schema": 99.0,
148
+ "compliance": 98.4,
149
+ "task_success": 92.3,
150
+ "schema_understanding": 0.481,
151
+ "task_completion": 0.530,
152
+ "tool_usage": 0.658,
153
+ "planning_effectiveness": 0.638,
154
+ "task_information": 0.473,
155
+ "tool_parameter": 0.303,
156
+ "dependency": 0.627
157
+ },
158
+ {
159
+ "name": "kimi-k2",
160
+ "overall_score": 0.629,
161
+ "valid_tool_schema": 98.8,
162
+ "compliance": 98.1,
163
+ "task_success": 94.5,
164
+ "schema_understanding": 0.502,
165
+ "task_completion": 0.577,
166
+ "tool_usage": 0.631,
167
+ "planning_effectiveness": 0.623,
168
+ "task_information": 0.448,
169
+ "tool_parameter": 0.307,
170
+ "dependency": 0.629
171
+ },
172
+ {
173
+ "name": "gpt-oss-20b",
174
+ "overall_score": 0.654,
175
+ "valid_tool_schema": 98.8,
176
+ "compliance": 99.1,
177
+ "task_success": 93.6,
178
+ "schema_understanding": 0.547,
179
+ "task_completion": 0.623,
180
+ "tool_usage": 0.661,
181
+ "planning_effectiveness": 0.638,
182
+ "task_information": 0.509,
183
+ "tool_parameter": 0.309,
184
+ "dependency": 0.654
185
+ },
186
+ {
187
+ "name": "glm-4.5",
188
+ "overall_score": 0.668,
189
+ "valid_tool_schema": 99.7,
190
+ "compliance": 99.7,
191
+ "task_success": 97.4,
192
+ "schema_understanding": 0.525,
193
+ "task_completion": 0.682,
194
+ "tool_usage": 0.680,
195
+ "planning_effectiveness": 0.661,
196
+ "task_information": 0.523,
197
+ "tool_parameter": 0.297,
198
+ "dependency": 0.668
199
+ },
200
+ {
201
+ "name": "qwen3-235b-a22b-2507",
202
+ "overall_score": 0.678,
203
+ "valid_tool_schema": 99.1,
204
+ "compliance": 99.3,
205
+ "task_success": 94.8,
206
+ "schema_understanding": 0.549,
207
+ "task_completion": 0.625,
208
+ "tool_usage": 0.688,
209
+ "planning_effectiveness": 0.712,
210
+ "task_information": 0.542,
211
+ "tool_parameter": 0.355,
212
+ "dependency": 0.678
213
+ },
214
+ {
215
+ "name": "claude-sonnet-4",
216
+ "overall_score": 0.681,
217
+ "valid_tool_schema": 100.0,
218
+ "compliance": 99.8,
219
+ "task_success": 98.8,
220
+ "schema_understanding": 0.554,
221
+ "task_completion": 0.676,
222
+ "tool_usage": 0.689,
223
+ "planning_effectiveness": 0.671,
224
+ "task_information": 0.541,
225
+ "tool_parameter": 0.328,
226
+ "dependency": 0.681
227
+ },
228
+ {
229
+ "name": "gemini-2.5-pro",
230
+ "overall_score": 0.690,
231
+ "valid_tool_schema": 99.4,
232
+ "compliance": 99.6,
233
+ "task_success": 96.9,
234
+ "schema_understanding": 0.562,
235
+ "task_completion": 0.725,
236
+ "tool_usage": 0.717,
237
+ "planning_effectiveness": 0.670,
238
+ "task_information": 0.541,
239
+ "tool_parameter": 0.329,
240
+ "dependency": 0.690
241
+ },
242
+ {
243
+ "name": "gpt-oss-120b",
244
+ "overall_score": 0.692,
245
+ "valid_tool_schema": 97.7,
246
+ "compliance": 98.8,
247
+ "task_success": 94.0,
248
+ "schema_understanding": 0.636,
249
+ "task_completion": 0.705,
250
+ "tool_usage": 0.691,
251
+ "planning_effectiveness": 0.661,
252
+ "task_information": 0.576,
253
+ "tool_parameter": 0.329,
254
+ "dependency": 0.692
255
+ },
256
+ {
257
+ "name": "o3",
258
+ "overall_score": 0.715,
259
+ "valid_tool_schema": 99.3,
260
+ "compliance": 99.9,
261
+ "task_success": 97.1,
262
+ "schema_understanding": 0.641,
263
+ "task_completion": 0.706,
264
+ "tool_usage": 0.724,
265
+ "planning_effectiveness": 0.726,
266
+ "task_information": 0.592,
267
+ "tool_parameter": 0.359,
268
+ "dependency": 0.715
269
+ },
270
+ {
271
+ "name": "gpt-5",
272
+ "overall_score": 0.749,
273
+ "valid_tool_schema": 100.0,
274
+ "compliance": 99.3,
275
+ "task_success": 99.1,
276
+ "schema_understanding": 0.677,
277
+ "task_completion": 0.828,
278
+ "tool_usage": 0.767,
279
+ "planning_effectiveness": 0.749,
280
+ "task_information": 0.649,
281
+ "tool_parameter": 0.339,
282
+ "dependency": 0.749
283
+ }
284
+ ]
285
+ }
index.html CHANGED
@@ -3,176 +3,161 @@
3
  <head>
4
  <meta charset="UTF-8">
5
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
- <title>MCP-BENCH: Benchmarking Tool-Using LLM Agents</title>
7
- <script src="https://cdn.tailwindcss.com"></script>
8
- <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
9
- <style>
10
- body {
11
- font-family: 'Inter', sans-serif;
12
- }
13
- .gradient-text {
14
- background: linear-gradient(to right, #4f46e5, #ec4899);
15
- -webkit-background-clip: text;
16
- -webkit-text-fill-color: transparent;
17
- }
18
- .table-hover tr:hover {
19
- background-color: #f9fafb;
20
- }
21
- </style>
22
  </head>
23
- <body class="bg-gray-50 text-gray-800">
24
-
25
- <div class="container mx-auto px-4 py-8 md:py-16 max-w-5xl">
26
-
27
- <!-- Header Section -->
28
- <header class="text-center mb-12">
29
- <h1 class="text-4xl md:text-5xl font-bold text-gray-900 mb-2">
30
- MCP-BENCH
31
- </h1>
32
- <h2 class="text-lg md:text-xl text-gray-600">
33
- Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers
34
- </h2>
 
 
 
 
 
 
 
 
35
  </header>
36
 
37
- <!-- Leaderboard Section -->
38
- <section id="leaderboard" class="mb-16">
39
- <h3 class="text-2xl md:text-3xl font-bold text-center mb-8 gradient-text">Leaderboard</h3>
40
- <div class="overflow-x-auto bg-white rounded-lg shadow-lg">
41
- <table class="min-w-full text-sm text-left text-gray-600">
42
- <thead class="bg-gray-100 text-xs text-gray-700 uppercase tracking-wider">
43
- <tr>
44
- <th scope="col" class="px-6 py-3 font-semibold">Rank</th>
45
- <th scope="col" class="px-6 py-3 font-semibold">Model</th>
46
- <th scope="col" class="px-6 py-3 font-semibold text-center">Overall Score</th>
47
- <th scope="col" class="px-6 py-3 font-semibold text-center">Task Fulfillment</th>
48
- <th scope="col" class="px-6 py-3 font-semibold text-center">Graph Exact Match</th>
49
- </tr>
50
- </thead>
51
- <!--
52
- LEADERBOARD DATA
53
- To update the leaderboard, edit the rows (<tr>...</tr>) below.
54
- Each row represents a model. The columns are:
55
- 1. Rank (#)
56
- 2. Model Name
57
- 3. Overall Score
58
- 4. Task Fulfillment (LLM Judge)
59
- 5. Graph Exact Match
60
- Make sure the data is sorted by the Overall Score in descending order.
61
- -->
62
- <tbody class="divide-y divide-gray-200 table-hover">
63
- <!-- Rank 1 -->
64
- <tr class="border-b border-gray-200">
65
- <td class="px-6 py-4 font-bold text-lg text-gray-900">1</td>
66
- <td class="px-6 py-4 font-semibold text-gray-900">GPT-4o-mini</td>
67
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.691</td>
68
- <td class="px-6 py-4 text-center">0.77</td>
69
- <td class="px-6 py-4 text-center">52.4%</td>
70
- </tr>
71
- <!-- Rank 2 -->
72
- <tr class="border-b border-gray-200">
73
- <td class="px-6 py-4 font-bold text-lg text-gray-900">2</td>
74
- <td class="px-6 py-4 font-semibold text-gray-900">Qwen-3-32b</td>
75
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.631</td>
76
- <td class="px-6 py-4 text-center">0.57</td>
77
- <td class="px-6 py-4 text-center">47.8%</td>
78
- </tr>
79
- <!-- Rank 3 -->
80
- <tr class="border-b border-gray-200">
81
- <td class="px-6 py-4 font-bold text-lg text-gray-900">3</td>
82
- <td class="px-6 py-4 font-semibold text-gray-900">DeepSeek-R1-Qwen-32b</td>
83
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.587</td>
84
- <td class="px-6 py-4 text-center">0.52</td>
85
- <td class="px-6 py-4 text-center">43.5%</td>
86
- </tr>
87
- <!-- Rank 4 -->
88
- <tr class="border-b border-gray-200">
89
- <td class="px-6 py-4 font-bold text-lg text-gray-900">4</td>
90
- <td class="px-6 py-4 font-semibold text-gray-900">Mistral-small-2403</td>
91
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.552</td>
92
- <td class="px-6 py-4 text-center">0.49</td>
93
- <td class="px-6 py-4 text-center">30.4%</td>
94
- </tr>
95
- <!-- Rank 5 -->
96
- <tr class="border-b border-gray-200">
97
- <td class="px-6 py-4 font-bold text-lg text-gray-900">5</td>
98
- <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-70b</td>
99
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.542</td>
100
- <td class="px-6 py-4 text-center">0.50</td>
101
- <td class="px-6 py-4 text-center">21.7%</td>
102
- </tr>
103
- <!-- Rank 6 -->
104
- <tr class="border-b border-gray-200">
105
- <td class="px-6 py-4 font-bold text-lg text-gray-900">6</td>
106
- <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-8b</td>
107
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.483</td>
108
- <td class="px-6 py-4 text-center">0.43</td>
109
- <td class="px-6 py-4 text-center">26.1%</td>
110
- </tr>
111
- <!-- Rank 7 -->
112
- <tr class="border-b border-gray-200">
113
- <td class="px-6 py-4 font-bold text-lg text-gray-900">7</td>
114
- <td class="px-6 py-4 font-semibold text-gray-900">Mistral-7b-v0.3</td>
115
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.423</td>
116
- <td class="px-6 py-4 text-center">0.50</td>
117
- <td class="px-6 py-4 text-center">0.0%</td>
118
- </tr>
119
- <!-- Rank 8 -->
120
- <tr class="border-b border-gray-200">
121
- <td class="px-6 py-4 font-bold text-lg text-gray-900">8</td>
122
- <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3-8b</td>
123
- <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.395</td>
124
- <td class="px-6 py-4 text-center">0.51</td>
125
- <td class="px-6 py-4 text-center">4.5%</td>
126
- </tr>
127
- </tbody>
128
- </table>
129
- </div>
130
- <p class="text-xs text-gray-500 text-center mt-4">Leaderboard data from Table 1 of the MCP-BENCH paper. Last updated: August 7, 2025.</p>
131
  </section>
132
 
133
- <!-- Abstract Section -->
134
- <section id="abstract" class="mb-16">
135
- <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Abstract</h3>
136
- <div class="bg-white p-8 rounded-lg shadow-md">
137
- <p class="text-gray-700 leading-relaxed">
138
- We introduce MCP-Bench, a new benchmark designed to evaluate large language models (LLMs) on realistic, multi-step tasks that require tool use, cross-tool coordination, and precise parameter control. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 31 live MCP servers spanning diverse real-world domains such as weather forecasting, stock analysis, scientific computing, and academic search. Tasks are structured as layered dependency graphs involving tools from one or more servers, testing an agent's ability to interpret tool schemas, plan coherent execution traces, retrieve relevant tools, and fill parameters with high structural and semantic fidelity. Unlike existing benchmarks, MCP-Bench targets real-world tool-use scenarios with complex input-output dependencies, diverse tool schemas, and multi-step reasoning requirements. We develop a multi-faceted evaluation framework that measures task success, tool-level execution accuracy, and alignment with ground-truth execution graphs. This includes metrics for tool name validity, schema compliance, graph exact match, structure-aware move distance, and semantic quality assessed by LLM-as-a-Judge. Experiments across 13 advanced LLMs—including GPT-4o, Claude 3, and LLaMA 3.1—reveal persistent challenges in long-horizon planning, tool reuse, and multi-server coordination. We release MCP-Bench, along with its evaluation toolkit, baseline results, and data synthesis pipeline, to enable robust and reproducible evaluation of agentic LLMs and to support future research on structured and scalable tool-based reasoning.
139
- </p>
140
- </div>
141
  </section>
142
-
143
- <!-- Links Section -->
144
- <section id="links" class="text-center mb-16">
145
- <div class="flex justify-center items-center space-x-4">
146
- <a href="#" onclick="alert('Paper download link not available yet.'); return false;" class="inline-block bg-indigo-600 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-indigo-700 transition-colors duration-300">
147
- Download Paper (PDF)
148
- </a>
149
- <a href="#" onclick="alert('Code repository link not available yet.'); return false;" class="inline-block bg-gray-700 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-gray-800 transition-colors duration-300">
150
- View Code on GitHub
151
- </a>
152
- </div>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
153
  </section>
154
 
155
  <!-- Citation Section -->
156
- <section id="citation">
157
- <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Citation</h3>
158
- <div class="bg-gray-200 p-6 rounded-lg shadow-inner">
159
- <pre class="text-sm text-gray-800 whitespace-pre-wrap break-words"><code>@misc{mcpbench2025,
160
- title={{MCP-BENCH: Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers}},
161
- author={Your Name and Co-authors},
162
- year={2025},
163
- eprint={},
164
- archivePrefix={arXiv},
165
- primaryClass={cs.CL}
166
- }</code></pre>
 
167
  </div>
168
  </section>
169
 
 
 
 
 
170
  </div>
171
 
172
- <!-- Footer -->
173
- <footer class="text-center py-6 bg-gray-100 border-t border-gray-200">
174
- <p class="text-sm text-gray-500">&copy; 2025 MCP-BENCH Project. All Rights Reserved.</p>
175
- </footer>
176
-
177
  </body>
178
- </html>
 
3
  <head>
4
  <meta charset="UTF-8">
5
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
6
+ <title>MCP Benchmark Leaderboard</title>
7
+ <link rel="stylesheet" href="style.css">
8
+ <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
9
+ <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet">
 
 
 
 
 
 
 
 
 
 
 
 
10
  </head>
11
+ <body>
12
+ <div class="container">
13
+ <!-- Paper Information -->
14
+ <header class="paper-header">
15
+ <h1 class="paper-title">MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers</h1>
16
+ <div class="paper-authors">
17
+ <p>Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow</p>
18
+ <p class="affiliation">Accenture, UC Berkeley</p>
19
+ </div>
20
+ <div class="paper-links">
21
+ <a href="https://github.com/Accenture/mcp-bench" class="paper-link">
22
+ <i class="fab fa-github"></i> GitHub
23
+ </a>
24
+ <a href="https://arxiv.org/abs/2508.20453" class="paper-link">
25
+ <i class="fas fa-file-pdf"></i> Paper
26
+ </a>
27
+ <a href="#leaderboard" class="paper-link">
28
+ <i class="fas fa-trophy"></i> Leaderboard
29
+ </a>
30
+ </div>
31
  </header>
32
 
33
+ <!-- MCP Diagram -->
34
+ <section class="diagram-section">
35
+ <img src="mcp-bench.png" alt="MCP-Bench Architecture Diagram" class="diagram-image">
36
+ <p class="diagram-caption">
37
+ MCP-Bench is a comprehensive evaluation framework designed to assess Large Language Models' (LLMs) capabilities in tool-use scenarios through the Model Context Protocol (MCP). This benchmark provides an end-to-end pipeline for evaluating how effectively different LLMs can discover, select, and utilize tools to solve real-world tasks.
38
+ </p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
39
  </section>
40
 
41
+ <!-- Ranking Chart -->
42
+ <section class="chart-section">
43
+ <h2 class="section-title">Performance Ranking</h2>
44
+ <img src="ranking.png" alt="MCP Benchmark Ranking Chart" class="ranking-chart">
 
 
 
 
45
  </section>
46
+
47
+ <!-- Leaderboard Header -->
48
+ <section class="leaderboard-section" id="leaderboard">
49
+ <h2 class="section-title">Detailed Results</h2>
50
+
51
+ <div class="controls">
52
+ <div class="search-container">
53
+ <i class="fas fa-search"></i>
54
+ <input type="text" id="searchInput" placeholder="Search models..." class="search-input">
55
+ </div>
56
+
57
+ <div class="filter-container">
58
+ <label for="sortSelect">Sort by:</label>
59
+ <select id="sortSelect" class="sort-select">
60
+ <option value="overall_score">Overall Score</option>
61
+ <option value="valid_tool_schema">Valid Tool Schema</option>
62
+ <option value="compliance">Compliance</option>
63
+ <option value="task_success">Task Success</option>
64
+ <option value="schema_understanding">Schema Understanding</option>
65
+ <option value="task_completion">Task Completion</option>
66
+ <option value="tool_usage">Tool Usage</option>
67
+ <option value="planning_effectiveness">Planning Effectiveness</option>
68
+ </select>
69
+
70
+ <button id="sortOrder" class="sort-btn" title="Toggle sort order">
71
+ <i class="fas fa-sort-amount-down"></i>
72
+ </button>
73
+ </div>
74
+ </div>
75
+
76
+ <div class="table-container">
77
+ <table class="leaderboard-table" id="leaderboardTable">
78
+ <thead>
79
+ <tr>
80
+ <th class="model-col sortable" data-column="name">
81
+ <strong>Model</strong>
82
+ <i class="fas fa-sort sort-icon"></i>
83
+ </th>
84
+ <th class="score-col sortable" data-column="overall_score">
85
+ <strong>Overall Score</strong>
86
+ <i class="fas fa-sort sort-icon"></i>
87
+ </th>
88
+ <th class="metric-col sortable" data-column="valid_tool_name_rate">
89
+ Valid Tool<br>Name Rate
90
+ <i class="fas fa-sort sort-icon"></i>
91
+ </th>
92
+ <th class="metric-col sortable" data-column="schema_compliance">
93
+ Schema<br>Compliance
94
+ <i class="fas fa-sort sort-icon"></i>
95
+ </th>
96
+ <th class="metric-col sortable" data-column="execution_success">
97
+ Execution<br>Success
98
+ <i class="fas fa-sort sort-icon"></i>
99
+ </th>
100
+ <th class="metric-col sortable" data-column="task_fulfillment">
101
+ Task<br>Fulfillment
102
+ <i class="fas fa-sort sort-icon"></i>
103
+ </th>
104
+ <th class="metric-col sortable" data-column="information_grounding">
105
+ Information<br>Grounding
106
+ <i class="fas fa-sort sort-icon"></i>
107
+ </th>
108
+ <th class="metric-col sortable" data-column="tool_appropriateness">
109
+ Tool<br>Appropriateness
110
+ <i class="fas fa-sort sort-icon"></i>
111
+ </th>
112
+ <th class="metric-col sortable" data-column="parameter_accuracy">
113
+ Parameter<br>Accuracy
114
+ <i class="fas fa-sort sort-icon"></i>
115
+ </th>
116
+ <th class="metric-col sortable" data-column="dependency_awareness">
117
+ Dependency<br>Awareness
118
+ <i class="fas fa-sort sort-icon"></i>
119
+ </th>
120
+ <th class="metric-col sortable" data-column="parallelism_efficiency">
121
+ Parallelism<br>and Efficiency
122
+ <i class="fas fa-sort sort-icon"></i>
123
+ </th>
124
+ </tr>
125
+ </thead>
126
+ <tbody id="tableBody">
127
+ <!-- Table rows will be generated by JavaScript -->
128
+ </tbody>
129
+ </table>
130
+ </div>
131
+
132
+ <div class="loading" id="loading">
133
+ <i class="fas fa-spinner fa-spin"></i>
134
+ Loading leaderboard data...
135
+ </div>
136
+
137
  </section>
138
 
139
  <!-- Citation Section -->
140
+ <section class="citation-section">
141
+ <h2 class="section-title">Citation</h2>
142
+ <div class="citation-box">
143
+ <pre class="citation-text">@article{wang2024mcpbench,
144
+ title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
145
+ author={Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene},
146
+ journal={arXiv preprint arXiv:2508.20453},
147
+ year={2024}
148
+ }</pre>
149
+ <button class="copy-citation-btn" onclick="copyCitation()">
150
+ <i class="fas fa-copy"></i> Copy Citation
151
+ </button>
152
  </div>
153
  </section>
154
 
155
+ <footer class="footer">
156
+ <p>Last updated: <span id="lastUpdated"></span></p>
157
+ <p>Data source: MCP-Bench Results (ArXiv: 2508.20453)</p>
158
+ </footer>
159
  </div>
160
 
161
+ <script src="script.js"></script>
 
 
 
 
162
  </body>
163
+ </html>
logos/.DS_Store ADDED
Binary file (6.15 kB). View file
 
logos/alibaba_logo.png ADDED
logos/anthropic_logo.webp ADDED
logos/aws_logo.png ADDED
logos/google_logo.jpg ADDED
logos/grok.png ADDED
logos/kimi.png ADDED
logos/meta_logo.png ADDED
logos/meta_logo.svg ADDED
logos/mistral_logo.png ADDED
logos/oai_logo.webp ADDED
logos/zhipu.png ADDED
mcp-bench.png ADDED

Git LFS Details

  • SHA256: 2554ee1cf9dee51779e69966d7cf9449d9cd58fa3391bfeb764c296269125744
  • Pointer size: 131 Bytes
  • Size of remote file: 477 kB
ranking.png ADDED

Git LFS Details

  • SHA256: 34772d568e20d96875d8b9d8dabf4623070e1f85fa11f037c208cfb24163ddb8
  • Pointer size: 131 Bytes
  • Size of remote file: 545 kB
requirements-hf.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ gradio>=4.0.0
requirements.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ gradio
script.js ADDED
@@ -0,0 +1,604 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ const LEADERBOARD_DATA = {
2
+ "lastUpdated": "2025-09-05",
3
+ "models": [
4
+ {
5
+ "name": "llama-3-1-8b-instruct",
6
+ "valid_tool_name_rate": 96.1,
7
+ "schema_compliance": 89.4,
8
+ "execution_success": 90.9,
9
+ "task_fulfillment": 0.261,
10
+ "information_grounding": 0.295,
11
+ "tool_appropriateness": 0.352,
12
+ "parameter_accuracy": 0.310,
13
+ "dependency_awareness": 0.221,
14
+ "parallelism_efficiency": 0.141,
15
+ "overall_score": 0.428
16
+ },
17
+ {
18
+ "name": "llama-3-2-90b-vision-instruct",
19
+ "valid_tool_name_rate": 99.6,
20
+ "schema_compliance": 85.0,
21
+ "execution_success": 90.9,
22
+ "task_fulfillment": 0.293,
23
+ "information_grounding": 0.444,
24
+ "tool_appropriateness": 0.515,
25
+ "parameter_accuracy": 0.427,
26
+ "dependency_awareness": 0.267,
27
+ "parallelism_efficiency": 0.173,
28
+ "overall_score": 0.495
29
+ },
30
+ {
31
+ "name": "nova-micro-v1",
32
+ "valid_tool_name_rate": 96.0,
33
+ "schema_compliance": 93.1,
34
+ "execution_success": 87.8,
35
+ "task_fulfillment": 0.339,
36
+ "information_grounding": 0.419,
37
+ "tool_appropriateness": 0.504,
38
+ "parameter_accuracy": 0.428,
39
+ "dependency_awareness": 0.315,
40
+ "parallelism_efficiency": 0.212,
41
+ "overall_score": 0.508
42
+ },
43
+ {
44
+ "name": "llama-3-1-70b-instruct",
45
+ "valid_tool_name_rate": 99.2,
46
+ "schema_compliance": 90.5,
47
+ "execution_success": 92.5,
48
+ "task_fulfillment": 0.314,
49
+ "information_grounding": 0.432,
50
+ "tool_appropriateness": 0.523,
51
+ "parameter_accuracy": 0.451,
52
+ "dependency_awareness": 0.287,
53
+ "parallelism_efficiency": 0.191,
54
+ "overall_score": 0.510
55
+ },
56
+ {
57
+ "name": "mistral-small-2503",
58
+ "valid_tool_name_rate": 96.4,
59
+ "schema_compliance": 95.6,
60
+ "execution_success": 86.2,
61
+ "task_fulfillment": 0.373,
62
+ "information_grounding": 0.445,
63
+ "tool_appropriateness": 0.537,
64
+ "parameter_accuracy": 0.446,
65
+ "dependency_awareness": 0.349,
66
+ "parallelism_efficiency": 0.232,
67
+ "overall_score": 0.530
68
+ },
69
+ {
70
+ "name": "gpt-4o-mini",
71
+ "valid_tool_name_rate": 97.5,
72
+ "schema_compliance": 98.1,
73
+ "execution_success": 93.9,
74
+ "task_fulfillment": 0.374,
75
+ "information_grounding": 0.500,
76
+ "tool_appropriateness": 0.555,
77
+ "parameter_accuracy": 0.544,
78
+ "dependency_awareness": 0.352,
79
+ "parallelism_efficiency": 0.201,
80
+ "overall_score": 0.557
81
+ },
82
+ {
83
+ "name": "llama-3-3-70b-instruct",
84
+ "valid_tool_name_rate": 99.5,
85
+ "schema_compliance": 93.8,
86
+ "execution_success": 91.6,
87
+ "task_fulfillment": 0.349,
88
+ "information_grounding": 0.493,
89
+ "tool_appropriateness": 0.583,
90
+ "parameter_accuracy": 0.525,
91
+ "dependency_awareness": 0.355,
92
+ "parallelism_efficiency": 0.262,
93
+ "overall_score": 0.558
94
+ },
95
+ {
96
+ "name": "gemma-3-27b-it",
97
+ "valid_tool_name_rate": 98.8,
98
+ "schema_compliance": 97.6,
99
+ "execution_success": 94.4,
100
+ "task_fulfillment": 0.378,
101
+ "information_grounding": 0.530,
102
+ "tool_appropriateness": 0.608,
103
+ "parameter_accuracy": 0.572,
104
+ "dependency_awareness": 0.383,
105
+ "parallelism_efficiency": 0.249,
106
+ "overall_score": 0.582
107
+ },
108
+ {
109
+ "name": "gpt-4o",
110
+ "valid_tool_name_rate": 98.9,
111
+ "schema_compliance": 98.3,
112
+ "execution_success": 92.8,
113
+ "task_fulfillment": 0.394,
114
+ "information_grounding": 0.542,
115
+ "tool_appropriateness": 0.627,
116
+ "parameter_accuracy": 0.587,
117
+ "dependency_awareness": 0.405,
118
+ "parallelism_efficiency": 0.272,
119
+ "overall_score": 0.595
120
+ },
121
+ {
122
+ "name": "gemini-2.5-flash-lite",
123
+ "valid_tool_name_rate": 99.4,
124
+ "schema_compliance": 97.8,
125
+ "execution_success": 94.3,
126
+ "task_fulfillment": 0.412,
127
+ "information_grounding": 0.577,
128
+ "tool_appropriateness": 0.627,
129
+ "parameter_accuracy": 0.597,
130
+ "dependency_awareness": 0.404,
131
+ "parallelism_efficiency": 0.226,
132
+ "overall_score": 0.598
133
+ },
134
+ {
135
+ "name": "qwen3-30b-a3b-instruct-2507",
136
+ "valid_tool_name_rate": 99.0,
137
+ "schema_compliance": 98.4,
138
+ "execution_success": 92.3,
139
+ "task_fulfillment": 0.481,
140
+ "information_grounding": 0.530,
141
+ "tool_appropriateness": 0.658,
142
+ "parameter_accuracy": 0.638,
143
+ "dependency_awareness": 0.473,
144
+ "parallelism_efficiency": 0.303,
145
+ "overall_score": 0.627
146
+ },
147
+ {
148
+ "name": "kimi-k2",
149
+ "valid_tool_name_rate": 98.8,
150
+ "schema_compliance": 98.1,
151
+ "execution_success": 94.5,
152
+ "task_fulfillment": 0.502,
153
+ "information_grounding": 0.577,
154
+ "tool_appropriateness": 0.631,
155
+ "parameter_accuracy": 0.623,
156
+ "dependency_awareness": 0.448,
157
+ "parallelism_efficiency": 0.307,
158
+ "overall_score": 0.629
159
+ },
160
+ {
161
+ "name": "gpt-oss-20b",
162
+ "valid_tool_name_rate": 98.8,
163
+ "schema_compliance": 99.1,
164
+ "execution_success": 93.6,
165
+ "task_fulfillment": 0.547,
166
+ "information_grounding": 0.623,
167
+ "tool_appropriateness": 0.661,
168
+ "parameter_accuracy": 0.638,
169
+ "dependency_awareness": 0.509,
170
+ "parallelism_efficiency": 0.309,
171
+ "overall_score": 0.654
172
+ },
173
+ {
174
+ "name": "glm-4.5",
175
+ "valid_tool_name_rate": 99.7,
176
+ "schema_compliance": 99.7,
177
+ "execution_success": 97.4,
178
+ "task_fulfillment": 0.525,
179
+ "information_grounding": 0.682,
180
+ "tool_appropriateness": 0.680,
181
+ "parameter_accuracy": 0.661,
182
+ "dependency_awareness": 0.523,
183
+ "parallelism_efficiency": 0.297,
184
+ "overall_score": 0.668
185
+ },
186
+ {
187
+ "name": "qwen3-235b-a22b-2507",
188
+ "valid_tool_name_rate": 99.1,
189
+ "schema_compliance": 99.3,
190
+ "execution_success": 94.8,
191
+ "task_fulfillment": 0.549,
192
+ "information_grounding": 0.625,
193
+ "tool_appropriateness": 0.688,
194
+ "parameter_accuracy": 0.712,
195
+ "dependency_awareness": 0.542,
196
+ "parallelism_efficiency": 0.355,
197
+ "overall_score": 0.678
198
+ },
199
+ {
200
+ "name": "claude-sonnet-4",
201
+ "valid_tool_name_rate": 100.0,
202
+ "schema_compliance": 99.8,
203
+ "execution_success": 98.8,
204
+ "task_fulfillment": 0.554,
205
+ "information_grounding": 0.676,
206
+ "tool_appropriateness": 0.689,
207
+ "parameter_accuracy": 0.671,
208
+ "dependency_awareness": 0.541,
209
+ "parallelism_efficiency": 0.328,
210
+ "overall_score": 0.681
211
+ },
212
+ {
213
+ "name": "gemini-2.5-pro",
214
+ "valid_tool_name_rate": 99.4,
215
+ "schema_compliance": 99.6,
216
+ "execution_success": 96.9,
217
+ "task_fulfillment": 0.562,
218
+ "information_grounding": 0.725,
219
+ "tool_appropriateness": 0.717,
220
+ "parameter_accuracy": 0.670,
221
+ "dependency_awareness": 0.541,
222
+ "parallelism_efficiency": 0.329,
223
+ "overall_score": 0.690
224
+ },
225
+ {
226
+ "name": "gpt-oss-120b",
227
+ "valid_tool_name_rate": 97.7,
228
+ "schema_compliance": 98.8,
229
+ "execution_success": 94.0,
230
+ "task_fulfillment": 0.636,
231
+ "information_grounding": 0.705,
232
+ "tool_appropriateness": 0.691,
233
+ "parameter_accuracy": 0.661,
234
+ "dependency_awareness": 0.576,
235
+ "parallelism_efficiency": 0.329,
236
+ "overall_score": 0.692
237
+ },
238
+ {
239
+ "name": "o3",
240
+ "valid_tool_name_rate": 99.3,
241
+ "schema_compliance": 99.9,
242
+ "execution_success": 97.1,
243
+ "task_fulfillment": 0.641,
244
+ "information_grounding": 0.706,
245
+ "tool_appropriateness": 0.724,
246
+ "parameter_accuracy": 0.726,
247
+ "dependency_awareness": 0.592,
248
+ "parallelism_efficiency": 0.359,
249
+ "overall_score": 0.715
250
+ },
251
+ {
252
+ "name": "gpt-5",
253
+ "valid_tool_name_rate": 100.0,
254
+ "schema_compliance": 99.3,
255
+ "execution_success": 99.1,
256
+ "task_fulfillment": 0.677,
257
+ "information_grounding": 0.828,
258
+ "tool_appropriateness": 0.767,
259
+ "parameter_accuracy": 0.749,
260
+ "dependency_awareness": 0.649,
261
+ "parallelism_efficiency": 0.339,
262
+ "overall_score": 0.749
263
+ }
264
+ ]
265
+ };
266
+
267
+ class LeaderboardApp {
268
+ constructor() {
269
+ this.data = null;
270
+ this.filteredData = null;
271
+ this.currentSort = { column: 'overall_score', ascending: false };
272
+
273
+ this.init();
274
+ }
275
+
276
+ async init() {
277
+ try {
278
+ this.loadData();
279
+ this.setupEventListeners();
280
+ this.renderTable();
281
+ this.updateLastUpdated();
282
+ } catch (error) {
283
+ console.error('Failed to initialize app:', error);
284
+ this.showError('Failed to load leaderboard data');
285
+ }
286
+ }
287
+
288
+ loadData() {
289
+ const loading = document.getElementById('loading');
290
+ loading.classList.add('active');
291
+
292
+ try {
293
+ this.data = LEADERBOARD_DATA;
294
+ this.filteredData = [...this.data.models];
295
+ this.sortData();
296
+ } finally {
297
+ loading.classList.remove('active');
298
+ }
299
+ }
300
+
301
+ setupEventListeners() {
302
+ const searchInput = document.getElementById('searchInput');
303
+ const sortSelect = document.getElementById('sortSelect');
304
+ const sortOrder = document.getElementById('sortOrder');
305
+ const tableHeaders = document.querySelectorAll('.sortable');
306
+
307
+ searchInput.addEventListener('input', (e) => this.handleSearch(e.target.value));
308
+ sortSelect.addEventListener('change', (e) => this.handleSortChange(e.target.value));
309
+ sortOrder.addEventListener('click', () => this.toggleSortOrder());
310
+
311
+ tableHeaders.forEach(header => {
312
+ header.addEventListener('click', () => {
313
+ const column = header.dataset.column;
314
+ this.handleColumnSort(column);
315
+ });
316
+ });
317
+ }
318
+
319
+ handleSearch(query) {
320
+ const searchTerm = query.toLowerCase().trim();
321
+
322
+ if (searchTerm === '') {
323
+ this.filteredData = [...this.data.models];
324
+ } else {
325
+ this.filteredData = this.data.models.filter(model =>
326
+ model.name.toLowerCase().includes(searchTerm)
327
+ );
328
+ }
329
+
330
+ this.sortData();
331
+ this.renderTable();
332
+ }
333
+
334
+ handleSortChange(column) {
335
+ this.currentSort.column = column;
336
+ this.sortData();
337
+ this.renderTable();
338
+ this.updateSortIndicators();
339
+ }
340
+
341
+ handleColumnSort(column) {
342
+ if (this.currentSort.column === column) {
343
+ this.currentSort.ascending = !this.currentSort.ascending;
344
+ } else {
345
+ this.currentSort.column = column;
346
+ this.currentSort.ascending = false;
347
+ }
348
+
349
+ document.getElementById('sortSelect').value = column;
350
+ this.sortData();
351
+ this.renderTable();
352
+ this.updateSortIndicators();
353
+ this.updateSortOrderButton();
354
+ }
355
+
356
+ toggleSortOrder() {
357
+ this.currentSort.ascending = !this.currentSort.ascending;
358
+ this.sortData();
359
+ this.renderTable();
360
+ this.updateSortOrderButton();
361
+ }
362
+
363
+ sortData() {
364
+ const { column, ascending } = this.currentSort;
365
+
366
+ this.filteredData.sort((a, b) => {
367
+ let aValue = a[column];
368
+ let bValue = b[column];
369
+
370
+ if (typeof aValue === 'string') {
371
+ aValue = aValue.toLowerCase();
372
+ bValue = bValue.toLowerCase();
373
+ }
374
+
375
+ let comparison = 0;
376
+ if (aValue > bValue) comparison = 1;
377
+ if (aValue < bValue) comparison = -1;
378
+
379
+ return ascending ? comparison : -comparison;
380
+ });
381
+ }
382
+
383
+ renderTable() {
384
+ const tableBody = document.getElementById('tableBody');
385
+
386
+ if (this.filteredData.length === 0) {
387
+ tableBody.innerHTML = `
388
+ <tr>
389
+ <td colspan="9" class="no-results">
390
+ <i class="fas fa-search"></i>
391
+ No models found matching your search criteria
392
+ </td>
393
+ </tr>
394
+ `;
395
+ return;
396
+ }
397
+
398
+ tableBody.innerHTML = this.filteredData
399
+ .map((model) => this.createTableRow(model))
400
+ .join('');
401
+ }
402
+
403
+ createTableRow(model) {
404
+ return `
405
+ <tr>
406
+ <td class="model-col">
407
+ <span class="model-name">${model.name}</span>
408
+ </td>
409
+ <td class="score-col">
410
+ <span class="score ${this.getScoreClass(model.overall_score)}">
411
+ ${model.overall_score.toFixed(3)}
412
+ </span>
413
+ </td>
414
+ <td class="metric-col">
415
+ ${this.createMetricCell(model.valid_tool_name_rate, true)}
416
+ </td>
417
+ <td class="metric-col">
418
+ ${this.createMetricCell(model.schema_compliance, true)}
419
+ </td>
420
+ <td class="metric-col">
421
+ ${this.createMetricCell(model.execution_success, true)}
422
+ </td>
423
+ <td class="metric-col">
424
+ ${this.createMetricCell(model.task_fulfillment)}
425
+ </td>
426
+ <td class="metric-col">
427
+ ${this.createMetricCell(model.information_grounding)}
428
+ </td>
429
+ <td class="metric-col">
430
+ ${this.createMetricCell(model.tool_appropriateness)}
431
+ </td>
432
+ <td class="metric-col">
433
+ ${this.createMetricCell(model.parameter_accuracy)}
434
+ </td>
435
+ <td class="metric-col">
436
+ ${this.createMetricCell(model.dependency_awareness)}
437
+ </td>
438
+ <td class="metric-col">
439
+ ${this.createMetricCell(model.parallelism_efficiency)}
440
+ </td>
441
+ </tr>
442
+ `;
443
+ }
444
+
445
+ createMetricCell(value, isPercentage = false) {
446
+ const displayValue = isPercentage ?
447
+ `${value.toFixed(1)}%` :
448
+ value.toFixed(3);
449
+
450
+ const normalizedValue = isPercentage ? value / 100 : value;
451
+ const scoreClass = this.getScoreClass(normalizedValue);
452
+ const barWidth = isPercentage ? value : (value * 100);
453
+
454
+ return `
455
+ <div class="metric" data-tooltip="${displayValue}">
456
+ <div class="metric-bar ${scoreClass}" style="width: ${barWidth}%"></div>
457
+ <span>${displayValue}</span>
458
+ </div>
459
+ `;
460
+ }
461
+
462
+ getRankClass(rank) {
463
+ if (rank === 1) return 'top-1';
464
+ if (rank <= 3) return 'top-3';
465
+ if (rank <= 5) return 'top-5';
466
+ return '';
467
+ }
468
+
469
+ getScoreClass(score) {
470
+ if (score >= 0.7) return 'excellent';
471
+ if (score >= 0.6) return 'good';
472
+ if (score >= 0.5) return 'average';
473
+ return 'poor';
474
+ }
475
+
476
+ updateSortIndicators() {
477
+ const headers = document.querySelectorAll('.sortable');
478
+ headers.forEach(header => {
479
+ header.classList.remove('active');
480
+ const icon = header.querySelector('.sort-icon');
481
+ icon.className = 'fas fa-sort sort-icon';
482
+ });
483
+
484
+ const activeHeader = document.querySelector(`[data-column="${this.currentSort.column}"]`);
485
+ if (activeHeader) {
486
+ activeHeader.classList.add('active');
487
+ const icon = activeHeader.querySelector('.sort-icon');
488
+ icon.className = this.currentSort.ascending ?
489
+ 'fas fa-sort-up sort-icon' :
490
+ 'fas fa-sort-down sort-icon';
491
+ }
492
+ }
493
+
494
+ updateSortOrderButton() {
495
+ const sortOrderButton = document.getElementById('sortOrder');
496
+ const icon = sortOrderButton.querySelector('i');
497
+ icon.className = this.currentSort.ascending ?
498
+ 'fas fa-sort-amount-up' :
499
+ 'fas fa-sort-amount-down';
500
+
501
+ sortOrderButton.title = this.currentSort.ascending ?
502
+ 'Sort descending' :
503
+ 'Sort ascending';
504
+ }
505
+
506
+ updateLastUpdated() {
507
+ const lastUpdatedElement = document.getElementById('lastUpdated');
508
+ if (this.data && this.data.lastUpdated) {
509
+ const date = new Date(this.data.lastUpdated);
510
+ lastUpdatedElement.textContent = date.toLocaleDateString('en-US', {
511
+ year: 'numeric',
512
+ month: 'long',
513
+ day: 'numeric'
514
+ });
515
+ } else {
516
+ lastUpdatedElement.textContent = new Date().toLocaleDateString('en-US', {
517
+ year: 'numeric',
518
+ month: 'long',
519
+ day: 'numeric'
520
+ });
521
+ }
522
+ }
523
+
524
+ showError(message) {
525
+ const tableBody = document.getElementById('tableBody');
526
+ tableBody.innerHTML = `
527
+ <tr>
528
+ <td colspan="9" class="no-results">
529
+ <i class="fas fa-exclamation-triangle"></i>
530
+ ${message}
531
+ </td>
532
+ </tr>
533
+ `;
534
+ }
535
+ }
536
+
537
+ // Copy citation function
538
+ function copyCitation() {
539
+ const citationText = `@article{wang2024mcpbench,
540
+ title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
541
+ author={Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene},
542
+ journal={arXiv preprint arXiv:2508.20453},
543
+ year={2024}
544
+ }`;
545
+
546
+ if (navigator.clipboard && window.isSecureContext) {
547
+ navigator.clipboard.writeText(citationText).then(() => {
548
+ showCopySuccess();
549
+ }).catch(err => {
550
+ console.error('Failed to copy citation:', err);
551
+ fallbackCopy(citationText);
552
+ });
553
+ } else {
554
+ fallbackCopy(citationText);
555
+ }
556
+ }
557
+
558
+ function fallbackCopy(text) {
559
+ const textArea = document.createElement('textarea');
560
+ textArea.value = text;
561
+ textArea.style.position = 'fixed';
562
+ textArea.style.left = '-999999px';
563
+ textArea.style.top = '-999999px';
564
+ document.body.appendChild(textArea);
565
+ textArea.focus();
566
+ textArea.select();
567
+
568
+ try {
569
+ document.execCommand('copy');
570
+ showCopySuccess();
571
+ } catch (err) {
572
+ console.error('Fallback copy failed:', err);
573
+ }
574
+
575
+ document.body.removeChild(textArea);
576
+ }
577
+
578
+ function showCopySuccess() {
579
+ const button = document.querySelector('.copy-citation-btn');
580
+ const originalText = button.innerHTML;
581
+ button.innerHTML = '<i class="fas fa-check"></i> Copied!';
582
+ button.style.backgroundColor = '#10b981';
583
+
584
+ setTimeout(() => {
585
+ button.innerHTML = originalText;
586
+ button.style.backgroundColor = '';
587
+ }, 2000);
588
+ }
589
+
590
+ document.addEventListener('DOMContentLoaded', () => {
591
+ new LeaderboardApp();
592
+ });
593
+
594
+ if ('serviceWorker' in navigator) {
595
+ window.addEventListener('load', () => {
596
+ navigator.serviceWorker.register('/sw.js')
597
+ .then((registration) => {
598
+ console.log('SW registered: ', registration);
599
+ })
600
+ .catch((registrationError) => {
601
+ console.log('SW registration failed: ', registrationError);
602
+ });
603
+ });
604
+ }
style.css CHANGED
@@ -1,28 +1,625 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  body {
2
- padding: 2rem;
3
- font-family: -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4
  }
5
 
6
- h1 {
7
- font-size: 16px;
8
- margin-top: 0;
 
 
 
9
  }
10
 
11
- p {
12
- color: rgb(107, 114, 128);
13
- font-size: 15px;
14
- margin-bottom: 10px;
15
- margin-top: 5px;
16
  }
17
 
18
- .card {
19
- max-width: 620px;
20
- margin: 0 auto;
21
- padding: 16px;
22
- border: 1px solid lightgray;
23
- border-radius: 16px;
 
 
 
24
  }
25
 
26
- .card p:last-child {
27
- margin-bottom: 0;
 
 
28
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ * {
2
+ margin: 0;
3
+ padding: 0;
4
+ box-sizing: border-box;
5
+ }
6
+
7
+ :root {
8
+ --primary-color: #2563eb;
9
+ --secondary-color: #1e40af;
10
+ --accent-color: #f59e0b;
11
+ --success-color: #10b981;
12
+ --warning-color: #f59e0b;
13
+ --error-color: #ef4444;
14
+
15
+ --bg-primary: #ffffff;
16
+ --bg-secondary: #f8fafc;
17
+ --bg-tertiary: #f1f5f9;
18
+
19
+ --text-primary: #1e293b;
20
+ --text-secondary: #64748b;
21
+ --text-muted: #94a3b8;
22
+
23
+ --border-color: #e2e8f0;
24
+ --shadow-sm: 0 1px 2px 0 rgb(0 0 0 / 0.05);
25
+ --shadow-md: 0 4px 6px -1px rgb(0 0 0 / 0.1), 0 2px 4px -2px rgb(0 0 0 / 0.1);
26
+ --shadow-lg: 0 10px 15px -3px rgb(0 0 0 / 0.1), 0 4px 6px -4px rgb(0 0 0 / 0.1);
27
+
28
+ --border-radius: 8px;
29
+ --border-radius-lg: 12px;
30
+
31
+ --font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
32
+ }
33
+
34
+ @media (prefers-color-scheme: dark) {
35
+ :root {
36
+ --bg-primary: #0f172a;
37
+ --bg-secondary: #1e293b;
38
+ --bg-tertiary: #334155;
39
+
40
+ --text-primary: #f8fafc;
41
+ --text-secondary: #cbd5e1;
42
+ --text-muted: #64748b;
43
+
44
+ --border-color: #334155;
45
+ }
46
+ }
47
+
48
  body {
49
+ font-family: var(--font-family);
50
+ background-color: var(--bg-secondary);
51
+ color: var(--text-primary);
52
+ line-height: 1.6;
53
+ }
54
+
55
+ .container {
56
+ max-width: 1400px;
57
+ margin: 0 auto;
58
+ padding: 20px;
59
+ }
60
+
61
+ .paper-header {
62
+ text-align: center;
63
+ margin-bottom: 40px;
64
+ padding: 30px 20px;
65
+ }
66
+
67
+ .paper-title {
68
+ font-size: 2.2rem;
69
+ font-weight: 700;
70
+ margin-bottom: 20px;
71
+ color: var(--text-primary);
72
+ line-height: 1.2;
73
+ }
74
+
75
+ .paper-authors {
76
+ color: var(--text-secondary);
77
+ }
78
+
79
+ .paper-authors p {
80
+ margin: 5px 0;
81
+ font-size: 1.1rem;
82
+ }
83
+
84
+ .affiliation {
85
+ font-style: italic;
86
+ margin-top: 10px;
87
+ }
88
+
89
+ .paper-links {
90
+ display: flex;
91
+ justify-content: center;
92
+ gap: 20px;
93
+ margin-top: 20px;
94
+ }
95
+
96
+ .paper-link {
97
+ display: inline-flex;
98
+ align-items: center;
99
+ gap: 8px;
100
+ padding: 10px 20px;
101
+ background-color: var(--primary-color);
102
+ color: white;
103
+ text-decoration: none;
104
+ border-radius: var(--border-radius);
105
+ font-weight: 500;
106
+ transition: all 0.3s ease;
107
+ box-shadow: var(--shadow-sm);
108
+ }
109
+
110
+ .paper-link:hover {
111
+ background-color: var(--secondary-color);
112
+ transform: translateY(-2px);
113
+ box-shadow: var(--shadow-md);
114
+ color: white;
115
+ text-decoration: none;
116
+ }
117
+
118
+ .diagram-section {
119
+ text-align: center;
120
+ margin-bottom: 40px;
121
+ padding: 20px;
122
+ background-color: var(--bg-primary);
123
+ border-radius: var(--border-radius-lg);
124
+ box-shadow: var(--shadow-md);
125
+ }
126
+
127
+ .diagram-image {
128
+ max-width: 100%;
129
+ height: auto;
130
+ border-radius: var(--border-radius);
131
+ margin-bottom: 20px;
132
+ }
133
+
134
+ .diagram-caption {
135
+ text-align: justify;
136
+ line-height: 1.6;
137
+ color: var(--text-secondary);
138
+ font-size: 1.1rem;
139
+ margin: 0 auto;
140
+ max-width: 1200px;
141
+ padding: 0 20px;
142
+ }
143
+
144
+ .chart-section {
145
+ margin-bottom: 40px;
146
+ padding: 20px;
147
+ background-color: var(--bg-primary);
148
+ border-radius: var(--border-radius-lg);
149
+ box-shadow: var(--shadow-md);
150
+ }
151
+
152
+ .section-title {
153
+ font-size: 1.8rem;
154
+ font-weight: 600;
155
+ margin-bottom: 20px;
156
+ color: var(--text-primary);
157
+ text-align: center;
158
+ }
159
+
160
+ .ranking-chart {
161
+ max-width: 100%;
162
+ height: auto;
163
+ display: block;
164
+ margin: 0 auto;
165
+ }
166
+
167
+ .leaderboard-section {
168
+ background-color: var(--bg-primary);
169
+ border-radius: var(--border-radius-lg);
170
+ padding: 20px;
171
+ box-shadow: var(--shadow-md);
172
+ }
173
+
174
+ .controls {
175
+ display: flex;
176
+ flex-wrap: wrap;
177
+ gap: 20px;
178
+ align-items: center;
179
+ justify-content: space-between;
180
+ margin-bottom: 30px;
181
+ padding: 25px;
182
+ background-color: var(--bg-primary);
183
+ border-radius: var(--border-radius-lg);
184
+ box-shadow: var(--shadow-md);
185
  }
186
 
187
+ .search-container {
188
+ display: flex;
189
+ align-items: center;
190
+ position: relative;
191
+ flex: 1;
192
+ min-width: 300px;
193
  }
194
 
195
+ .search-container i {
196
+ position: absolute;
197
+ left: 15px;
198
+ color: var(--text-muted);
199
+ z-index: 1;
200
  }
201
 
202
+ .search-input {
203
+ width: 100%;
204
+ padding: 12px 15px 12px 45px;
205
+ border: 2px solid var(--border-color);
206
+ border-radius: var(--border-radius);
207
+ font-size: 1rem;
208
+ background-color: var(--bg-secondary);
209
+ color: var(--text-primary);
210
+ transition: all 0.3s ease;
211
  }
212
 
213
+ .search-input:focus {
214
+ outline: none;
215
+ border-color: var(--primary-color);
216
+ box-shadow: 0 0 0 3px rgb(37 99 235 / 0.1);
217
  }
218
+
219
+ .filter-container {
220
+ display: flex;
221
+ align-items: center;
222
+ gap: 15px;
223
+ }
224
+
225
+ .filter-container label {
226
+ font-weight: 500;
227
+ color: var(--text-secondary);
228
+ }
229
+
230
+ .sort-select {
231
+ padding: 10px 15px;
232
+ border: 2px solid var(--border-color);
233
+ border-radius: var(--border-radius);
234
+ background-color: var(--bg-secondary);
235
+ color: var(--text-primary);
236
+ font-size: 1rem;
237
+ cursor: pointer;
238
+ transition: border-color 0.3s ease;
239
+ }
240
+
241
+ .sort-select:focus {
242
+ outline: none;
243
+ border-color: var(--primary-color);
244
+ }
245
+
246
+ .sort-btn {
247
+ padding: 10px 12px;
248
+ border: 2px solid var(--border-color);
249
+ border-radius: var(--border-radius);
250
+ background-color: var(--bg-secondary);
251
+ color: var(--text-secondary);
252
+ cursor: pointer;
253
+ transition: all 0.3s ease;
254
+ }
255
+
256
+ .sort-btn:hover {
257
+ background-color: var(--primary-color);
258
+ color: white;
259
+ border-color: var(--primary-color);
260
+ }
261
+
262
+ .table-container {
263
+ background-color: var(--bg-primary);
264
+ border-radius: var(--border-radius-lg);
265
+ overflow: hidden;
266
+ box-shadow: var(--shadow-lg);
267
+ margin-bottom: 30px;
268
+ }
269
+
270
+ .leaderboard-table {
271
+ width: 100%;
272
+ border-collapse: collapse;
273
+ font-size: 0.9rem;
274
+ }
275
+
276
+ .leaderboard-table thead {
277
+ background: linear-gradient(135deg, var(--bg-tertiary), var(--bg-secondary));
278
+ }
279
+
280
+ .leaderboard-table th,
281
+ .leaderboard-table td {
282
+ padding: 15px 12px;
283
+ text-align: left;
284
+ border-bottom: 1px solid var(--border-color);
285
+ }
286
+
287
+ .leaderboard-table th {
288
+ font-weight: 600;
289
+ color: var(--text-primary);
290
+ position: sticky;
291
+ top: 0;
292
+ background: var(--bg-tertiary);
293
+ z-index: 10;
294
+ }
295
+
296
+ .sortable {
297
+ cursor: pointer;
298
+ user-select: none;
299
+ transition: background-color 0.2s ease;
300
+ position: relative;
301
+ }
302
+
303
+ .sortable:hover {
304
+ background-color: var(--border-color);
305
+ }
306
+
307
+ .sort-icon {
308
+ margin-left: 8px;
309
+ opacity: 0.5;
310
+ transition: opacity 0.2s ease;
311
+ }
312
+
313
+ .sortable:hover .sort-icon {
314
+ opacity: 1;
315
+ }
316
+
317
+ .sortable.active .sort-icon {
318
+ opacity: 1;
319
+ color: var(--primary-color);
320
+ }
321
+
322
+ .rank-col {
323
+ width: 80px;
324
+ text-align: center;
325
+ }
326
+
327
+ .model-col {
328
+ width: 250px;
329
+ min-width: 200px;
330
+ }
331
+
332
+ .score-col {
333
+ width: 120px;
334
+ }
335
+
336
+ .metric-col {
337
+ width: 110px;
338
+ text-align: center;
339
+ }
340
+
341
+ .leaderboard-table tbody tr {
342
+ transition: all 0.2s ease;
343
+ }
344
+
345
+ .leaderboard-table tbody tr:hover {
346
+ background-color: var(--bg-secondary);
347
+ transform: translateY(-1px);
348
+ box-shadow: 0 4px 8px rgb(0 0 0 / 0.1);
349
+ }
350
+
351
+ .rank {
352
+ font-weight: 700;
353
+ font-size: 1.1rem;
354
+ color: var(--primary-color);
355
+ }
356
+
357
+ .rank.top-1 {
358
+ color: #fbbf24;
359
+ }
360
+
361
+ .rank.top-3 {
362
+ color: #f59e0b;
363
+ }
364
+
365
+ .rank.top-5 {
366
+ color: #10b981;
367
+ }
368
+
369
+ .model-name {
370
+ font-weight: 600;
371
+ color: var(--text-primary);
372
+ }
373
+
374
+ .score {
375
+ font-weight: 600;
376
+ padding: 6px 12px;
377
+ border-radius: var(--border-radius);
378
+ color: white;
379
+ text-align: center;
380
+ }
381
+
382
+ .score.excellent {
383
+ background: linear-gradient(135deg, #10b981, #059669);
384
+ }
385
+
386
+ .score.good {
387
+ background: linear-gradient(135deg, #3b82f6, #2563eb);
388
+ }
389
+
390
+ .score.average {
391
+ background: linear-gradient(135deg, #f59e0b, #d97706);
392
+ }
393
+
394
+ .score.poor {
395
+ background: linear-gradient(135deg, #ef4444, #dc2626);
396
+ }
397
+
398
+ .metric {
399
+ position: relative;
400
+ }
401
+
402
+ .metric-bar {
403
+ position: absolute;
404
+ left: 0;
405
+ top: 0;
406
+ height: 100%;
407
+ border-radius: var(--border-radius);
408
+ opacity: 0.1;
409
+ transition: opacity 0.3s ease;
410
+ }
411
+
412
+ .metric-bar.excellent {
413
+ background-color: var(--success-color);
414
+ }
415
+
416
+ .metric-bar.good {
417
+ background-color: var(--primary-color);
418
+ }
419
+
420
+ .metric-bar.average {
421
+ background-color: var(--warning-color);
422
+ }
423
+
424
+ .metric-bar.poor {
425
+ background-color: var(--error-color);
426
+ }
427
+
428
+ .leaderboard-table tbody tr:hover .metric-bar {
429
+ opacity: 0.2;
430
+ }
431
+
432
+ .loading {
433
+ text-align: center;
434
+ padding: 60px 20px;
435
+ color: var(--text-secondary);
436
+ font-size: 1.1rem;
437
+ display: none;
438
+ }
439
+
440
+ .loading.active {
441
+ display: block;
442
+ }
443
+
444
+ .loading i {
445
+ font-size: 2rem;
446
+ margin-bottom: 15px;
447
+ display: block;
448
+ color: var(--primary-color);
449
+ }
450
+
451
+ .citation-section {
452
+ margin-bottom: 40px;
453
+ padding: 20px;
454
+ background-color: var(--bg-primary);
455
+ border-radius: var(--border-radius-lg);
456
+ box-shadow: var(--shadow-md);
457
+ }
458
+
459
+ .citation-box {
460
+ position: relative;
461
+ background-color: var(--bg-secondary);
462
+ border: 1px solid var(--border-color);
463
+ border-radius: var(--border-radius);
464
+ padding: 20px;
465
+ margin-top: 20px;
466
+ }
467
+
468
+ .citation-text {
469
+ font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
470
+ font-size: 0.85rem;
471
+ line-height: 1.4;
472
+ color: var(--text-primary);
473
+ margin: 0;
474
+ padding: 0;
475
+ white-space: pre-wrap;
476
+ overflow-x: auto;
477
+ }
478
+
479
+ .copy-citation-btn {
480
+ position: absolute;
481
+ top: 15px;
482
+ right: 15px;
483
+ background-color: var(--primary-color);
484
+ color: white;
485
+ border: none;
486
+ border-radius: var(--border-radius);
487
+ padding: 8px 12px;
488
+ font-size: 0.85rem;
489
+ cursor: pointer;
490
+ transition: all 0.3s ease;
491
+ display: flex;
492
+ align-items: center;
493
+ gap: 5px;
494
+ }
495
+
496
+ .copy-citation-btn:hover {
497
+ background-color: var(--secondary-color);
498
+ transform: translateY(-1px);
499
+ }
500
+
501
+ .copy-citation-btn:active {
502
+ transform: translateY(0);
503
+ }
504
+
505
+ .footer {
506
+ text-align: center;
507
+ padding: 30px 20px;
508
+ color: var(--text-muted);
509
+ background-color: var(--bg-primary);
510
+ border-radius: var(--border-radius-lg);
511
+ margin-top: 40px;
512
+ }
513
+
514
+ .footer p {
515
+ margin-bottom: 5px;
516
+ }
517
+
518
+ @media (max-width: 1200px) {
519
+ .table-container {
520
+ overflow-x: auto;
521
+ }
522
+
523
+ .leaderboard-table {
524
+ min-width: 1000px;
525
+ }
526
+ }
527
+
528
+ @media (max-width: 768px) {
529
+ .container {
530
+ padding: 15px;
531
+ }
532
+
533
+ .title {
534
+ font-size: 2rem;
535
+ }
536
+
537
+ .controls {
538
+ flex-direction: column;
539
+ align-items: stretch;
540
+ }
541
+
542
+ .search-container {
543
+ min-width: auto;
544
+ }
545
+
546
+ .filter-container {
547
+ justify-content: space-between;
548
+ }
549
+
550
+ .leaderboard-table th,
551
+ .leaderboard-table td {
552
+ padding: 10px 8px;
553
+ font-size: 0.8rem;
554
+ }
555
+
556
+ .model-col {
557
+ width: 180px;
558
+ min-width: 160px;
559
+ }
560
+
561
+ .metric-col {
562
+ width: 90px;
563
+ }
564
+ }
565
+
566
+ @media (max-width: 480px) {
567
+ .title {
568
+ font-size: 1.5rem;
569
+ flex-direction: column;
570
+ gap: 10px;
571
+ }
572
+
573
+ .leaderboard-table {
574
+ min-width: 800px;
575
+ }
576
+
577
+ .leaderboard-table th,
578
+ .leaderboard-table td {
579
+ padding: 8px 6px;
580
+ }
581
+ }
582
+
583
+ .no-results {
584
+ text-align: center;
585
+ padding: 60px 20px;
586
+ color: var(--text-muted);
587
+ font-size: 1.1rem;
588
+ }
589
+
590
+ .no-results i {
591
+ font-size: 3rem;
592
+ margin-bottom: 20px;
593
+ display: block;
594
+ opacity: 0.5;
595
+ }
596
+
597
+ @keyframes fadeIn {
598
+ from { opacity: 0; transform: translateY(10px); }
599
+ to { opacity: 1; transform: translateY(0); }
600
+ }
601
+
602
+ .leaderboard-table tbody tr {
603
+ animation: fadeIn 0.3s ease forwards;
604
+ }
605
+
606
+ .tooltip {
607
+ position: relative;
608
+ cursor: help;
609
+ }
610
+
611
+ .tooltip:hover::after {
612
+ content: attr(data-tooltip);
613
+ position: absolute;
614
+ bottom: 100%;
615
+ left: 50%;
616
+ transform: translateX(-50%);
617
+ background-color: var(--text-primary);
618
+ color: var(--bg-primary);
619
+ padding: 8px 12px;
620
+ border-radius: var(--border-radius);
621
+ font-size: 0.8rem;
622
+ white-space: nowrap;
623
+ z-index: 1000;
624
+ box-shadow: var(--shadow-md);
625
+ }