Spaces:

mcpbench
/

mcp-bench

Running

App Files Files

xet

Community

ztwang commited on Sep 8

Commit

4966301

verified ·

1 Parent(s): 7b6b43e

Upload 22 files

Browse files

Files changed (23) hide show

.gitattributes +2 -0
README.md +157 -10
app.py +45 -0
data.json +285 -0
index.html +143 -158
logos/.DS_Store +0 -0
logos/alibaba_logo.png +0 -0
logos/anthropic_logo.webp +0 -0
logos/aws_logo.png +0 -0
logos/google_logo.jpg +0 -0
logos/grok.png +0 -0
logos/kimi.png +0 -0
logos/meta_logo.png +0 -0
logos/meta_logo.svg +1 -0
logos/mistral_logo.png +0 -0
logos/oai_logo.webp +0 -0
logos/zhipu.png +0 -0
mcp-bench.png +3 -0
ranking.png +3 -0
requirements-hf.txt +1 -0
requirements.txt +1 -0
script.js +604 -0
style.css +615 -18

.gitattributes CHANGED Viewed

@@ -33,3 +33,5 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+mcp-bench.png filter=lfs diff=lfs merge=lfs -text
+ranking.png filter=lfs diff=lfs merge=lfs -text

README.md CHANGED Viewed

@@ -1,12 +1,159 @@
----
-title: Mcp Bench
-emoji: 🌍
-colorFrom: red
-colorTo: gray
-sdk: static
-pinned: false
-license: mit
-short_description: Benchmarking Tool-Using LLM Agents with Real-World Tasks via
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# MCP Benchmark Leaderboard
+A modern, interactive web application displaying performance metrics for various Language Learning Models (LLMs) in the MCP (Model Control Protocol) benchmark.
+## 🏆 Features
+- **Interactive Leaderboard**: Sort by any metric column
+- **Real-time Search**: Filter models by name
+- **Responsive Design**: Optimized for desktop and mobile
+- **Visual Indicators**: Color-coded performance levels and progress bars
+- **Modern UI**: Clean, professional Material Design interface
+- **Dark Mode Support**: Automatic dark/light theme detection
+## 📊 Metrics Displayed
+The leaderboard shows comprehensive performance metrics:
+- **Overall Score**: Combined performance metric
+- **Valid Tool Schema**: Percentage of valid tool schemas
+- **Compliance**: Rule compliance percentage
+- **Task Success**: Task completion success rate
+- **Schema Understanding**: Understanding of tool schemas
+- **Task Completion**: Task completion effectiveness
+- **Tool Usage**: Tool utilization efficiency
+- **Planning Effectiveness**: Planning and execution quality
+## 🚀 Quick Start
+### Local Development
+1. Clone this repository
+2. Open `index.html` in your web browser
+3. Or serve using a local HTTP server:
+```bash
+# Using Python
+python -m http.server 8000
+# Using Node.js
+npx serve .
+# Using PHP
+php -S localhost:8000
+```
+### Hugging Face Spaces Deployment
+This project is optimized for deployment on Hugging Face Spaces:
+1. Create a new Space on [Hugging Face](https://huggingface.co/spaces)
+2. Choose **Gradio** as the SDK
+3. Upload all files to your Space
+4. Rename `requirements-hf.txt` to `requirements.txt`
+5. Your Space will automatically build and deploy
+The `app.py` file provides Gradio integration for Hugging Face Spaces compatibility.
+## 📁 Project Structure
+```
+mcp-bench-leaderboard/
+├── index.html          # Main HTML page
+├── style.css           # Responsive CSS styling
+├── script.js           # Interactive JavaScript functionality
+├── data.json           # Leaderboard data
+├── app.py             # Gradio app for HF Spaces
+├── requirements-hf.txt # Dependencies for HF deployment
+└── README.md          # Documentation
+```
+## 🎨 Customization
+### Update Data
+Modify `data.json` to add new models or update scores:
+```json
+{
+  "lastUpdated": "2025-09-05",
+  "models": [
+    {
+      "name": "your-model-name",
+      "overall_score": 0.750,
+      "valid_tool_schema": 99.5,
+      "compliance": 98.2,
+      // ... other metrics
+    }
+  ]
+}
+```
+### Styling
+Edit `style.css` to customize:
+- Colors and themes
+- Layout and spacing
+- Responsive breakpoints
+- Animation effects
+### Functionality
+Extend `script.js` to add:
+- New sorting algorithms
+- Additional filtering options
+- Export functionality
+- Chart visualizations
+## 🌐 Browser Support
+- Chrome 60+
+- Firefox 55+
+- Safari 12+
+- Edge 79+
+## 📱 Mobile Compatibility
+The application is fully responsive and optimized for:
+- Tablets (768px - 1024px)
+- Mobile phones (320px - 767px)
+- Large screens (1200px+)
+## 🔧 Technical Details
+- **Pure Frontend**: No backend dependencies
+- **Vanilla JavaScript**: No frameworks required
+- **Modern CSS**: Flexbox, Grid, CSS Variables
+- **Progressive Enhancement**: Works without JavaScript
+- **SEO Friendly**: Semantic HTML structure
+## 📈 Performance
+- Lightweight (~50KB total)
+- Fast loading times
+- Optimized images and assets
+- Efficient DOM updates
+- Smooth animations
+## 🤝 Contributing
+1. Fork the repository
+2. Create a feature branch
+3. Make your changes
+4. Test across browsers
+5. Submit a pull request
+## 📄 License
+This project is open source and available under the MIT License.
+## 🙏 Acknowledgments
+- Data sourced from MCP Benchmark Results
+- Icons from Font Awesome
+- Fonts from Google Fonts
+- Hosted on Hugging Face Spaces
 ---
+*Last updated: September 2025*

app.py ADDED Viewed

	@@ -0,0 +1,45 @@

+import gradio as gr
+import os
+def create_gradio_app():
+    """
+    Simple Gradio app to serve the static HTML leaderboard
+    This is required for Hugging Face Spaces deployment
+    """
+    # Read the HTML content
+    with open('index.html', 'r', encoding='utf-8') as f:
+        html_content = f.read()
+    # Read the CSS content
+    with open('style.css', 'r', encoding='utf-8') as f:
+        css_content = f.read()
+    # Read the JavaScript content
+    with open('script.js', 'r', encoding='utf-8') as f:
+        js_content = f.read()
+    # Combine everything into a single HTML page
+    combined_html = html_content.replace(
+        '<link rel="stylesheet" href="style.css">',
+        f'<style>{css_content}</style>'
+    ).replace(
+        '<script src="script.js"></script>',
+        f'<script>{js_content}</script>'
+    )
+    # Create the Gradio interface
+    with gr.Blocks(
+        title="MCP Benchmark Leaderboard",
+        theme=gr.themes.Soft(),
+    ) as demo:
+        gr.HTML(
+            combined_html,
+            elem_id="leaderboard-container"
+        )
+    return demo
+if __name__ == "__main__":
+    demo = create_gradio_app()
+    demo.launch()

data.json ADDED Viewed

	@@ -0,0 +1,285 @@

+{
+  "lastUpdated": "2025-09-05",
+  "models": [
+    {
+      "name": "llama-3-1-8b-instruct",
+      "overall_score": 0.428,
+      "valid_tool_schema": 96.1,
+      "compliance": 89.4,
+      "task_success": 90.9,
+      "schema_understanding": 0.261,
+      "task_completion": 0.295,
+      "tool_usage": 0.352,
+      "planning_effectiveness": 0.310,
+      "task_information": 0.221,
+      "tool_parameter": 0.141,
+      "dependency": 0.428
+    },
+    {
+      "name": "llama-3-2-90b-vision-instruct",
+      "overall_score": 0.495,
+      "valid_tool_schema": 99.6,
+      "compliance": 85.0,
+      "task_success": 90.9,
+      "schema_understanding": 0.293,
+      "task_completion": 0.444,
+      "tool_usage": 0.515,
+      "planning_effectiveness": 0.427,
+      "task_information": 0.267,
+      "tool_parameter": 0.173,
+      "dependency": 0.495
+    },
+    {
+      "name": "nova-micro-v1",
+      "overall_score": 0.508,
+      "valid_tool_schema": 96.0,
+      "compliance": 93.1,
+      "task_success": 87.8,
+      "schema_understanding": 0.339,
+      "task_completion": 0.419,
+      "tool_usage": 0.504,
+      "planning_effectiveness": 0.428,
+      "task_information": 0.315,
+      "tool_parameter": 0.212,
+      "dependency": 0.508
+    },
+    {
+      "name": "llama-3-1-70b-instruct",
+      "overall_score": 0.510,
+      "valid_tool_schema": 99.2,
+      "compliance": 90.5,
+      "task_success": 92.5,
+      "schema_understanding": 0.314,
+      "task_completion": 0.432,
+      "tool_usage": 0.523,
+      "planning_effectiveness": 0.451,
+      "task_information": 0.287,
+      "tool_parameter": 0.191,
+      "dependency": 0.510
+    },
+    {
+      "name": "mistral-small-2503",
+      "overall_score": 0.530,
+      "valid_tool_schema": 96.4,
+      "compliance": 95.6,
+      "task_success": 86.2,
+      "schema_understanding": 0.373,
+      "task_completion": 0.445,
+      "tool_usage": 0.537,
+      "planning_effectiveness": 0.446,
+      "task_information": 0.349,
+      "tool_parameter": 0.232,
+      "dependency": 0.530
+    },
+    {
+      "name": "gpt-4o-mini",
+      "overall_score": 0.557,
+      "valid_tool_schema": 97.5,
+      "compliance": 98.1,
+      "task_success": 93.9,
+      "schema_understanding": 0.374,
+      "task_completion": 0.500,
+      "tool_usage": 0.555,
+      "planning_effectiveness": 0.544,
+      "task_information": 0.352,
+      "tool_parameter": 0.201,
+      "dependency": 0.557
+    },
+    {
+      "name": "llama-3-3-70b-instruct",
+      "overall_score": 0.558,
+      "valid_tool_schema": 99.5,
+      "compliance": 93.8,
+      "task_success": 91.6,
+      "schema_understanding": 0.349,
+      "task_completion": 0.493,
+      "tool_usage": 0.583,
+      "planning_effectiveness": 0.525,
+      "task_information": 0.355,
+      "tool_parameter": 0.262,
+      "dependency": 0.558
+    },
+    {
+      "name": "gemma-3-27b-it",
+      "overall_score": 0.582,
+      "valid_tool_schema": 98.8,
+      "compliance": 97.6,
+      "task_success": 94.4,
+      "schema_understanding": 0.378,
+      "task_completion": 0.530,
+      "tool_usage": 0.608,
+      "planning_effectiveness": 0.572,
+      "task_information": 0.383,
+      "tool_parameter": 0.249,
+      "dependency": 0.582
+    },
+    {
+      "name": "gpt-4o",
+      "overall_score": 0.595,
+      "valid_tool_schema": 98.9,
+      "compliance": 98.3,
+      "task_success": 92.8,
+      "schema_understanding": 0.394,
+      "task_completion": 0.542,
+      "tool_usage": 0.627,
+      "planning_effectiveness": 0.587,
+      "task_information": 0.405,
+      "tool_parameter": 0.272,
+      "dependency": 0.595
+    },
+    {
+      "name": "gemini-2.5-flash-lite",
+      "overall_score": 0.598,
+      "valid_tool_schema": 99.4,
+      "compliance": 97.8,
+      "task_success": 94.3,
+      "schema_understanding": 0.412,
+      "task_completion": 0.577,
+      "tool_usage": 0.627,
+      "planning_effectiveness": 0.597,
+      "task_information": 0.404,
+      "tool_parameter": 0.226,
+      "dependency": 0.598
+    },
+    {
+      "name": "qwen3-30b-a3b-instruct-2507",
+      "overall_score": 0.627,
+      "valid_tool_schema": 99.0,
+      "compliance": 98.4,
+      "task_success": 92.3,
+      "schema_understanding": 0.481,
+      "task_completion": 0.530,
+      "tool_usage": 0.658,
+      "planning_effectiveness": 0.638,
+      "task_information": 0.473,
+      "tool_parameter": 0.303,
+      "dependency": 0.627
+    },
+    {
+      "name": "kimi-k2",
+      "overall_score": 0.629,
+      "valid_tool_schema": 98.8,
+      "compliance": 98.1,
+      "task_success": 94.5,
+      "schema_understanding": 0.502,
+      "task_completion": 0.577,
+      "tool_usage": 0.631,
+      "planning_effectiveness": 0.623,
+      "task_information": 0.448,
+      "tool_parameter": 0.307,
+      "dependency": 0.629
+    },
+    {
+      "name": "gpt-oss-20b",
+      "overall_score": 0.654,
+      "valid_tool_schema": 98.8,
+      "compliance": 99.1,
+      "task_success": 93.6,
+      "schema_understanding": 0.547,
+      "task_completion": 0.623,
+      "tool_usage": 0.661,
+      "planning_effectiveness": 0.638,
+      "task_information": 0.509,
+      "tool_parameter": 0.309,
+      "dependency": 0.654
+    },
+    {
+      "name": "glm-4.5",
+      "overall_score": 0.668,
+      "valid_tool_schema": 99.7,
+      "compliance": 99.7,
+      "task_success": 97.4,
+      "schema_understanding": 0.525,
+      "task_completion": 0.682,
+      "tool_usage": 0.680,
+      "planning_effectiveness": 0.661,
+      "task_information": 0.523,
+      "tool_parameter": 0.297,
+      "dependency": 0.668
+    },
+    {
+      "name": "qwen3-235b-a22b-2507",
+      "overall_score": 0.678,
+      "valid_tool_schema": 99.1,
+      "compliance": 99.3,
+      "task_success": 94.8,
+      "schema_understanding": 0.549,
+      "task_completion": 0.625,
+      "tool_usage": 0.688,
+      "planning_effectiveness": 0.712,
+      "task_information": 0.542,
+      "tool_parameter": 0.355,
+      "dependency": 0.678
+    },
+    {
+      "name": "claude-sonnet-4",
+      "overall_score": 0.681,
+      "valid_tool_schema": 100.0,
+      "compliance": 99.8,
+      "task_success": 98.8,
+      "schema_understanding": 0.554,
+      "task_completion": 0.676,
+      "tool_usage": 0.689,
+      "planning_effectiveness": 0.671,
+      "task_information": 0.541,
+      "tool_parameter": 0.328,
+      "dependency": 0.681
+    },
+    {
+      "name": "gemini-2.5-pro",
+      "overall_score": 0.690,
+      "valid_tool_schema": 99.4,
+      "compliance": 99.6,
+      "task_success": 96.9,
+      "schema_understanding": 0.562,
+      "task_completion": 0.725,
+      "tool_usage": 0.717,
+      "planning_effectiveness": 0.670,
+      "task_information": 0.541,
+      "tool_parameter": 0.329,
+      "dependency": 0.690
+    },
+    {
+      "name": "gpt-oss-120b",
+      "overall_score": 0.692,
+      "valid_tool_schema": 97.7,
+      "compliance": 98.8,
+      "task_success": 94.0,
+      "schema_understanding": 0.636,
+      "task_completion": 0.705,
+      "tool_usage": 0.691,
+      "planning_effectiveness": 0.661,
+      "task_information": 0.576,
+      "tool_parameter": 0.329,
+      "dependency": 0.692
+    },
+    {
+      "name": "o3",
+      "overall_score": 0.715,
+      "valid_tool_schema": 99.3,
+      "compliance": 99.9,
+      "task_success": 97.1,
+      "schema_understanding": 0.641,
+      "task_completion": 0.706,
+      "tool_usage": 0.724,
+      "planning_effectiveness": 0.726,
+      "task_information": 0.592,
+      "tool_parameter": 0.359,
+      "dependency": 0.715
+    },
+    {
+      "name": "gpt-5",
+      "overall_score": 0.749,
+      "valid_tool_schema": 100.0,
+      "compliance": 99.3,
+      "task_success": 99.1,
+      "schema_understanding": 0.677,
+      "task_completion": 0.828,
+      "tool_usage": 0.767,
+      "planning_effectiveness": 0.749,
+      "task_information": 0.649,
+      "tool_parameter": 0.339,
+      "dependency": 0.749
+    }
+  ]
+}

index.html CHANGED Viewed

@@ -3,176 +3,161 @@
 <head>
     <meta charset="UTF-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
-    <title>MCP-BENCH: Benchmarking Tool-Using LLM Agents</title>
-    <script src="https://cdn.tailwindcss.com"></script>
-    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@400;500;600;700&display=swap" rel="stylesheet">
-    <style>
-        body {
-            font-family: 'Inter', sans-serif;
-        }
-        .gradient-text {
-            background: linear-gradient(to right, #4f46e5, #ec4899);
-            -webkit-background-clip: text;
-            -webkit-text-fill-color: transparent;
-        }
-        .table-hover tr:hover {
-            background-color: #f9fafb;
-        }
-    </style>
 </head>
-<body class="bg-gray-50 text-gray-800">
-    <div class="container mx-auto px-4 py-8 md:py-16 max-w-5xl">
-        <!-- Header Section -->
-        <header class="text-center mb-12">
-            <h1 class="text-4xl md:text-5xl font-bold text-gray-900 mb-2">
-                MCP-BENCH
-            </h1>
-            <h2 class="text-lg md:text-xl text-gray-600">
-                Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers
-            </h2>
         </header>
-        <!-- Leaderboard Section -->
-        <section id="leaderboard" class="mb-16">
-            <h3 class="text-2xl md:text-3xl font-bold text-center mb-8 gradient-text">Leaderboard</h3>
-            <div class="overflow-x-auto bg-white rounded-lg shadow-lg">
-                <table class="min-w-full text-sm text-left text-gray-600">
-                    <thead class="bg-gray-100 text-xs text-gray-700 uppercase tracking-wider">
-                        <tr>
-                            <th scope="col" class="px-6 py-3 font-semibold">Rank</th>
-                            <th scope="col" class="px-6 py-3 font-semibold">Model</th>
-                            <th scope="col" class="px-6 py-3 font-semibold text-center">Overall Score</th>
-                            <th scope="col" class="px-6 py-3 font-semibold text-center">Task Fulfillment</th>
-                            <th scope="col" class="px-6 py-3 font-semibold text-center">Graph Exact Match</th>
-                        </tr>
-                    </thead>
-                    <!--
-                        LEADERBOARD DATA
-                        To update the leaderboard, edit the rows (<tr>...</tr>) below.
-                        Each row represents a model. The columns are:
-                        1. Rank (#)
-                        2. Model Name
-                        3. Overall Score
-                        4. Task Fulfillment (LLM Judge)
-                        5. Graph Exact Match
-                        Make sure the data is sorted by the Overall Score in descending order.
-                    -->
-                    <tbody class="divide-y divide-gray-200 table-hover">
-                        <!-- Rank 1 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">1</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">GPT-4o-mini</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.691</td>
-                            <td class="px-6 py-4 text-center">0.77</td>
-                            <td class="px-6 py-4 text-center">52.4%</td>
-                        </tr>
-                        <!-- Rank 2 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">2</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">Qwen-3-32b</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.631</td>
-                            <td class="px-6 py-4 text-center">0.57</td>
-                            <td class="px-6 py-4 text-center">47.8%</td>
-                        </tr>
-                        <!-- Rank 3 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">3</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">DeepSeek-R1-Qwen-32b</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.587</td>
-                            <td class="px-6 py-4 text-center">0.52</td>
-                            <td class="px-6 py-4 text-center">43.5%</td>
-                        </tr>
-                        <!-- Rank 4 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">4</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">Mistral-small-2403</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.552</td>
-                            <td class="px-6 py-4 text-center">0.49</td>
-                            <td class="px-6 py-4 text-center">30.4%</td>
-                        </tr>
-                        <!-- Rank 5 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">5</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-70b</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.542</td>
-                            <td class="px-6 py-4 text-center">0.50</td>
-                            <td class="px-6 py-4 text-center">21.7%</td>
-                        </tr>
-                         <!-- Rank 6 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">6</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3.1-8b</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.483</td>
-                            <td class="px-6 py-4 text-center">0.43</td>
-                            <td class="px-6 py-4 text-center">26.1%</td>
-                        </tr>
-                         <!-- Rank 7 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">7</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">Mistral-7b-v0.3</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.423</td>
-                            <td class="px-6 py-4 text-center">0.50</td>
-                            <td class="px-6 py-4 text-center">0.0%</td>
-                        </tr>
-                         <!-- Rank 8 -->
-                        <tr class="border-b border-gray-200">
-                            <td class="px-6 py-4 font-bold text-lg text-gray-900">8</td>
-                            <td class="px-6 py-4 font-semibold text-gray-900">LLaMA-3-8b</td>
-                            <td class="px-6 py-4 font-semibold text-center text-indigo-600">0.395</td>
-                            <td class="px-6 py-4 text-center">0.51</td>
-                            <td class="px-6 py-4 text-center">4.5%</td>
-                        </tr>
-                    </tbody>
-                </table>
-            </div>
-            <p class="text-xs text-gray-500 text-center mt-4">Leaderboard data from Table 1 of the MCP-BENCH paper. Last updated: August 7, 2025.</p>
         </section>
-        <!-- Abstract Section -->
-        <section id="abstract" class="mb-16">
-            <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Abstract</h3>
-            <div class="bg-white p-8 rounded-lg shadow-md">
-                <p class="text-gray-700 leading-relaxed">
-                    We introduce MCP-Bench, a new benchmark designed to evaluate large language models (LLMs) on realistic, multi-step tasks that require tool use, cross-tool coordination, and precise parameter control. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 31 live MCP servers spanning diverse real-world domains such as weather forecasting, stock analysis, scientific computing, and academic search. Tasks are structured as layered dependency graphs involving tools from one or more servers, testing an agent's ability to interpret tool schemas, plan coherent execution traces, retrieve relevant tools, and fill parameters with high structural and semantic fidelity. Unlike existing benchmarks, MCP-Bench targets real-world tool-use scenarios with complex input-output dependencies, diverse tool schemas, and multi-step reasoning requirements. We develop a multi-faceted evaluation framework that measures task success, tool-level execution accuracy, and alignment with ground-truth execution graphs. This includes metrics for tool name validity, schema compliance, graph exact match, structure-aware move distance, and semantic quality assessed by LLM-as-a-Judge. Experiments across 13 advanced LLMs—including GPT-4o, Claude 3, and LLaMA 3.1—reveal persistent challenges in long-horizon planning, tool reuse, and multi-server coordination. We release MCP-Bench, along with its evaluation toolkit, baseline results, and data synthesis pipeline, to enable robust and reproducible evaluation of agentic LLMs and to support future research on structured and scalable tool-based reasoning.
-                </p>
-            </div>
         </section>
-        <!-- Links Section -->
-        <section id="links" class="text-center mb-16">
-             <div class="flex justify-center items-center space-x-4">
-                 <a href="#" onclick="alert('Paper download link not available yet.'); return false;" class="inline-block bg-indigo-600 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-indigo-700 transition-colors duration-300">
-                     Download Paper (PDF)
-                 </a>
-                 <a href="#" onclick="alert('Code repository link not available yet.'); return false;" class="inline-block bg-gray-700 text-white font-semibold px-6 py-3 rounded-lg shadow-md hover:bg-gray-800 transition-colors duration-300">
-                     View Code on GitHub
-                 </a>
-             </div>
         </section>
         <!-- Citation Section -->
-        <section id="citation">
-            <h3 class="text-2xl md:text-3xl font-bold text-center mb-8">Citation</h3>
-            <div class="bg-gray-200 p-6 rounded-lg shadow-inner">
-                <pre class="text-sm text-gray-800 whitespace-pre-wrap break-words"><code>@misc{mcpbench2025,
-    title={{MCP-BENCH: Benchmarking Tool-Using LLM Agents with Real-World Tasks via MCP Servers}},
-    author={Your Name and Co-authors},
-    year={2025},
-    eprint={},
-    archivePrefix={arXiv},
-    primaryClass={cs.CL}
-}</code></pre>
             </div>
         </section>
     </div>
-    <!-- Footer -->
-    <footer class="text-center py-6 bg-gray-100 border-t border-gray-200">
-        <p class="text-sm text-gray-500">&copy; 2025 MCP-BENCH Project. All Rights Reserved.</p>
-    </footer>
 </body>
-</html>

 <head>
     <meta charset="UTF-8">
     <meta name="viewport" content="width=device-width, initial-scale=1.0">
+    <title>MCP Benchmark Leaderboard</title>
+    <link rel="stylesheet" href="style.css">
+    <link href="https://fonts.googleapis.com/css2?family=Inter:wght@300;400;500;600;700&display=swap" rel="stylesheet">
+    <link href="https://cdnjs.cloudflare.com/ajax/libs/font-awesome/6.0.0/css/all.min.css" rel="stylesheet">
 </head>
+<body>
+    <div class="container">
+        <!-- Paper Information -->
+        <header class="paper-header">
+            <h1 class="paper-title">MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers</h1>
+            <div class="paper-authors">
+                <p>Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow</p>
+                <p class="affiliation">Accenture, UC Berkeley</p>
+            </div>
+            <div class="paper-links">
+                <a href="https://github.com/Accenture/mcp-bench" class="paper-link">
+                    <i class="fab fa-github"></i> GitHub
+                </a>
+                <a href="https://arxiv.org/abs/2508.20453" class="paper-link">
+                    <i class="fas fa-file-pdf"></i> Paper
+                </a>
+                <a href="#leaderboard" class="paper-link">
+                    <i class="fas fa-trophy"></i> Leaderboard
+                </a>
+            </div>
         </header>
+        <!-- MCP Diagram -->
+        <section class="diagram-section">
+            <img src="mcp-bench.png" alt="MCP-Bench Architecture Diagram" class="diagram-image">
+            <p class="diagram-caption">
+                MCP-Bench is a comprehensive evaluation framework designed to assess Large Language Models' (LLMs) capabilities in tool-use scenarios through the Model Context Protocol (MCP). This benchmark provides an end-to-end pipeline for evaluating how effectively different LLMs can discover, select, and utilize tools to solve real-world tasks.
+            </p>
         </section>
+        <!-- Ranking Chart -->
+        <section class="chart-section">
+            <h2 class="section-title">Performance Ranking</h2>
+            <img src="ranking.png" alt="MCP Benchmark Ranking Chart" class="ranking-chart">
         </section>
+        <!-- Leaderboard Header -->
+        <section class="leaderboard-section" id="leaderboard">
+            <h2 class="section-title">Detailed Results</h2>
+        <div class="controls">
+            <div class="search-container">
+                <i class="fas fa-search"></i>
+                <input type="text" id="searchInput" placeholder="Search models..." class="search-input">
+            </div>
+            <div class="filter-container">
+                <label for="sortSelect">Sort by:</label>
+                <select id="sortSelect" class="sort-select">
+                    <option value="overall_score">Overall Score</option>
+                    <option value="valid_tool_schema">Valid Tool Schema</option>
+                    <option value="compliance">Compliance</option>
+                    <option value="task_success">Task Success</option>
+                    <option value="schema_understanding">Schema Understanding</option>
+                    <option value="task_completion">Task Completion</option>
+                    <option value="tool_usage">Tool Usage</option>
+                    <option value="planning_effectiveness">Planning Effectiveness</option>
+                </select>
+                <button id="sortOrder" class="sort-btn" title="Toggle sort order">
+                    <i class="fas fa-sort-amount-down"></i>
+                </button>
+            </div>
+        </div>
+        <div class="table-container">
+            <table class="leaderboard-table" id="leaderboardTable">
+                <thead>
+                    <tr>
+                        <th class="model-col sortable" data-column="name">
+                            <strong>Model</strong>
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="score-col sortable" data-column="overall_score">
+                            <strong>Overall Score</strong>
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="valid_tool_name_rate">
+                            Valid Tool<br>Name Rate
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="schema_compliance">
+                            Schema<br>Compliance
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="execution_success">
+                            Execution<br>Success
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="task_fulfillment">
+                            Task<br>Fulfillment
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="information_grounding">
+                            Information<br>Grounding
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="tool_appropriateness">
+                            Tool<br>Appropriateness
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="parameter_accuracy">
+                            Parameter<br>Accuracy
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="dependency_awareness">
+                            Dependency<br>Awareness
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                        <th class="metric-col sortable" data-column="parallelism_efficiency">
+                            Parallelism<br>and Efficiency
+                            <i class="fas fa-sort sort-icon"></i>
+                        </th>
+                    </tr>
+                </thead>
+                <tbody id="tableBody">
+                    <!-- Table rows will be generated by JavaScript -->
+                </tbody>
+            </table>
+        </div>
+        <div class="loading" id="loading">
+            <i class="fas fa-spinner fa-spin"></i>
+            Loading leaderboard data...
+        </div>
         </section>
         <!-- Citation Section -->
+        <section class="citation-section">
+            <h2 class="section-title">Citation</h2>
+            <div class="citation-box">
+                <pre class="citation-text">@article{wang2024mcpbench,
+  title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
+  author={Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene},
+  journal={arXiv preprint arXiv:2508.20453},
+  year={2024}
+}</pre>
+                <button class="copy-citation-btn" onclick="copyCitation()">
+                    <i class="fas fa-copy"></i> Copy Citation
+                </button>
             </div>
         </section>
+        <footer class="footer">
+            <p>Last updated: <span id="lastUpdated"></span></p>
+            <p>Data source: MCP-Bench Results (ArXiv: 2508.20453)</p>
+        </footer>
     </div>
+    <script src="script.js"></script>
 </body>
+</html>

logos/.DS_Store ADDED Viewed

Binary file (6.15 kB). View file

logos/alibaba_logo.png ADDED Viewed

logos/anthropic_logo.webp ADDED Viewed

logos/aws_logo.png ADDED Viewed

logos/google_logo.jpg ADDED Viewed

logos/grok.png ADDED Viewed

logos/kimi.png ADDED Viewed

logos/meta_logo.png ADDED Viewed

logos/meta_logo.svg ADDED Viewed

logos/mistral_logo.png ADDED Viewed

logos/oai_logo.webp ADDED Viewed

logos/zhipu.png ADDED Viewed

mcp-bench.png ADDED Viewed

Git LFS Details

SHA256: 2554ee1cf9dee51779e69966d7cf9449d9cd58fa3391bfeb764c296269125744
Pointer size: 131 Bytes
Size of remote file: 477 kB

ranking.png ADDED Viewed

Git LFS Details

SHA256: 34772d568e20d96875d8b9d8dabf4623070e1f85fa11f037c208cfb24163ddb8
Pointer size: 131 Bytes
Size of remote file: 545 kB

requirements-hf.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ gradio>=4.0.0

requirements.txt ADDED Viewed

	@@ -0,0 +1 @@


1	+ gradio

script.js ADDED Viewed

	@@ -0,0 +1,604 @@

+const LEADERBOARD_DATA = {
+    "lastUpdated": "2025-09-05",
+    "models": [
+        {
+            "name": "llama-3-1-8b-instruct",
+            "valid_tool_name_rate": 96.1,
+            "schema_compliance": 89.4,
+            "execution_success": 90.9,
+            "task_fulfillment": 0.261,
+            "information_grounding": 0.295,
+            "tool_appropriateness": 0.352,
+            "parameter_accuracy": 0.310,
+            "dependency_awareness": 0.221,
+            "parallelism_efficiency": 0.141,
+            "overall_score": 0.428
+        },
+        {
+            "name": "llama-3-2-90b-vision-instruct",
+            "valid_tool_name_rate": 99.6,
+            "schema_compliance": 85.0,
+            "execution_success": 90.9,
+            "task_fulfillment": 0.293,
+            "information_grounding": 0.444,
+            "tool_appropriateness": 0.515,
+            "parameter_accuracy": 0.427,
+            "dependency_awareness": 0.267,
+            "parallelism_efficiency": 0.173,
+            "overall_score": 0.495
+        },
+        {
+            "name": "nova-micro-v1",
+            "valid_tool_name_rate": 96.0,
+            "schema_compliance": 93.1,
+            "execution_success": 87.8,
+            "task_fulfillment": 0.339,
+            "information_grounding": 0.419,
+            "tool_appropriateness": 0.504,
+            "parameter_accuracy": 0.428,
+            "dependency_awareness": 0.315,
+            "parallelism_efficiency": 0.212,
+            "overall_score": 0.508
+        },
+        {
+            "name": "llama-3-1-70b-instruct",
+            "valid_tool_name_rate": 99.2,
+            "schema_compliance": 90.5,
+            "execution_success": 92.5,
+            "task_fulfillment": 0.314,
+            "information_grounding": 0.432,
+            "tool_appropriateness": 0.523,
+            "parameter_accuracy": 0.451,
+            "dependency_awareness": 0.287,
+            "parallelism_efficiency": 0.191,
+            "overall_score": 0.510
+        },
+        {
+            "name": "mistral-small-2503",
+            "valid_tool_name_rate": 96.4,
+            "schema_compliance": 95.6,
+            "execution_success": 86.2,
+            "task_fulfillment": 0.373,
+            "information_grounding": 0.445,
+            "tool_appropriateness": 0.537,
+            "parameter_accuracy": 0.446,
+            "dependency_awareness": 0.349,
+            "parallelism_efficiency": 0.232,
+            "overall_score": 0.530
+        },
+        {
+            "name": "gpt-4o-mini",
+            "valid_tool_name_rate": 97.5,
+            "schema_compliance": 98.1,
+            "execution_success": 93.9,
+            "task_fulfillment": 0.374,
+            "information_grounding": 0.500,
+            "tool_appropriateness": 0.555,
+            "parameter_accuracy": 0.544,
+            "dependency_awareness": 0.352,
+            "parallelism_efficiency": 0.201,
+            "overall_score": 0.557
+        },
+        {
+            "name": "llama-3-3-70b-instruct",
+            "valid_tool_name_rate": 99.5,
+            "schema_compliance": 93.8,
+            "execution_success": 91.6,
+            "task_fulfillment": 0.349,
+            "information_grounding": 0.493,
+            "tool_appropriateness": 0.583,
+            "parameter_accuracy": 0.525,
+            "dependency_awareness": 0.355,
+            "parallelism_efficiency": 0.262,
+            "overall_score": 0.558
+        },
+        {
+            "name": "gemma-3-27b-it",
+            "valid_tool_name_rate": 98.8,
+            "schema_compliance": 97.6,
+            "execution_success": 94.4,
+            "task_fulfillment": 0.378,
+            "information_grounding": 0.530,
+            "tool_appropriateness": 0.608,
+            "parameter_accuracy": 0.572,
+            "dependency_awareness": 0.383,
+            "parallelism_efficiency": 0.249,
+            "overall_score": 0.582
+        },
+        {
+            "name": "gpt-4o",
+            "valid_tool_name_rate": 98.9,
+            "schema_compliance": 98.3,
+            "execution_success": 92.8,
+            "task_fulfillment": 0.394,
+            "information_grounding": 0.542,
+            "tool_appropriateness": 0.627,
+            "parameter_accuracy": 0.587,
+            "dependency_awareness": 0.405,
+            "parallelism_efficiency": 0.272,
+            "overall_score": 0.595
+        },
+        {
+            "name": "gemini-2.5-flash-lite",
+            "valid_tool_name_rate": 99.4,
+            "schema_compliance": 97.8,
+            "execution_success": 94.3,
+            "task_fulfillment": 0.412,
+            "information_grounding": 0.577,
+            "tool_appropriateness": 0.627,
+            "parameter_accuracy": 0.597,
+            "dependency_awareness": 0.404,
+            "parallelism_efficiency": 0.226,
+            "overall_score": 0.598
+        },
+        {
+            "name": "qwen3-30b-a3b-instruct-2507",
+            "valid_tool_name_rate": 99.0,
+            "schema_compliance": 98.4,
+            "execution_success": 92.3,
+            "task_fulfillment": 0.481,
+            "information_grounding": 0.530,
+            "tool_appropriateness": 0.658,
+            "parameter_accuracy": 0.638,
+            "dependency_awareness": 0.473,
+            "parallelism_efficiency": 0.303,
+            "overall_score": 0.627
+        },
+        {
+            "name": "kimi-k2",
+            "valid_tool_name_rate": 98.8,
+            "schema_compliance": 98.1,
+            "execution_success": 94.5,
+            "task_fulfillment": 0.502,
+            "information_grounding": 0.577,
+            "tool_appropriateness": 0.631,
+            "parameter_accuracy": 0.623,
+            "dependency_awareness": 0.448,
+            "parallelism_efficiency": 0.307,
+            "overall_score": 0.629
+        },
+        {
+            "name": "gpt-oss-20b",
+            "valid_tool_name_rate": 98.8,
+            "schema_compliance": 99.1,
+            "execution_success": 93.6,
+            "task_fulfillment": 0.547,
+            "information_grounding": 0.623,
+            "tool_appropriateness": 0.661,
+            "parameter_accuracy": 0.638,
+            "dependency_awareness": 0.509,
+            "parallelism_efficiency": 0.309,
+            "overall_score": 0.654
+        },
+        {
+            "name": "glm-4.5",
+            "valid_tool_name_rate": 99.7,
+            "schema_compliance": 99.7,
+            "execution_success": 97.4,
+            "task_fulfillment": 0.525,
+            "information_grounding": 0.682,
+            "tool_appropriateness": 0.680,
+            "parameter_accuracy": 0.661,
+            "dependency_awareness": 0.523,
+            "parallelism_efficiency": 0.297,
+            "overall_score": 0.668
+        },
+        {
+            "name": "qwen3-235b-a22b-2507",
+            "valid_tool_name_rate": 99.1,
+            "schema_compliance": 99.3,
+            "execution_success": 94.8,
+            "task_fulfillment": 0.549,
+            "information_grounding": 0.625,
+            "tool_appropriateness": 0.688,
+            "parameter_accuracy": 0.712,
+            "dependency_awareness": 0.542,
+            "parallelism_efficiency": 0.355,
+            "overall_score": 0.678
+        },
+        {
+            "name": "claude-sonnet-4",
+            "valid_tool_name_rate": 100.0,
+            "schema_compliance": 99.8,
+            "execution_success": 98.8,
+            "task_fulfillment": 0.554,
+            "information_grounding": 0.676,
+            "tool_appropriateness": 0.689,
+            "parameter_accuracy": 0.671,
+            "dependency_awareness": 0.541,
+            "parallelism_efficiency": 0.328,
+            "overall_score": 0.681
+        },
+        {
+            "name": "gemini-2.5-pro",
+            "valid_tool_name_rate": 99.4,
+            "schema_compliance": 99.6,
+            "execution_success": 96.9,
+            "task_fulfillment": 0.562,
+            "information_grounding": 0.725,
+            "tool_appropriateness": 0.717,
+            "parameter_accuracy": 0.670,
+            "dependency_awareness": 0.541,
+            "parallelism_efficiency": 0.329,
+            "overall_score": 0.690
+        },
+        {
+            "name": "gpt-oss-120b",
+            "valid_tool_name_rate": 97.7,
+            "schema_compliance": 98.8,
+            "execution_success": 94.0,
+            "task_fulfillment": 0.636,
+            "information_grounding": 0.705,
+            "tool_appropriateness": 0.691,
+            "parameter_accuracy": 0.661,
+            "dependency_awareness": 0.576,
+            "parallelism_efficiency": 0.329,
+            "overall_score": 0.692
+        },
+        {
+            "name": "o3",
+            "valid_tool_name_rate": 99.3,
+            "schema_compliance": 99.9,
+            "execution_success": 97.1,
+            "task_fulfillment": 0.641,
+            "information_grounding": 0.706,
+            "tool_appropriateness": 0.724,
+            "parameter_accuracy": 0.726,
+            "dependency_awareness": 0.592,
+            "parallelism_efficiency": 0.359,
+            "overall_score": 0.715
+        },
+        {
+            "name": "gpt-5",
+            "valid_tool_name_rate": 100.0,
+            "schema_compliance": 99.3,
+            "execution_success": 99.1,
+            "task_fulfillment": 0.677,
+            "information_grounding": 0.828,
+            "tool_appropriateness": 0.767,
+            "parameter_accuracy": 0.749,
+            "dependency_awareness": 0.649,
+            "parallelism_efficiency": 0.339,
+            "overall_score": 0.749
+        }
+    ]
+};
+class LeaderboardApp {
+    constructor() {
+        this.data = null;
+        this.filteredData = null;
+        this.currentSort = { column: 'overall_score', ascending: false };
+        this.init();
+    }
+    async init() {
+        try {
+            this.loadData();
+            this.setupEventListeners();
+            this.renderTable();
+            this.updateLastUpdated();
+        } catch (error) {
+            console.error('Failed to initialize app:', error);
+            this.showError('Failed to load leaderboard data');
+        }
+    }
+    loadData() {
+        const loading = document.getElementById('loading');
+        loading.classList.add('active');
+        try {
+            this.data = LEADERBOARD_DATA;
+            this.filteredData = [...this.data.models];
+            this.sortData();
+        } finally {
+            loading.classList.remove('active');
+        }
+    }
+    setupEventListeners() {
+        const searchInput = document.getElementById('searchInput');
+        const sortSelect = document.getElementById('sortSelect');
+        const sortOrder = document.getElementById('sortOrder');
+        const tableHeaders = document.querySelectorAll('.sortable');
+        searchInput.addEventListener('input', (e) => this.handleSearch(e.target.value));
+        sortSelect.addEventListener('change', (e) => this.handleSortChange(e.target.value));
+        sortOrder.addEventListener('click', () => this.toggleSortOrder());
+        tableHeaders.forEach(header => {
+            header.addEventListener('click', () => {
+                const column = header.dataset.column;
+                this.handleColumnSort(column);
+            });
+        });
+    }
+    handleSearch(query) {
+        const searchTerm = query.toLowerCase().trim();
+        if (searchTerm === '') {
+            this.filteredData = [...this.data.models];
+        } else {
+            this.filteredData = this.data.models.filter(model =>
+                model.name.toLowerCase().includes(searchTerm)
+            );
+        }
+        this.sortData();
+        this.renderTable();
+    }
+    handleSortChange(column) {
+        this.currentSort.column = column;
+        this.sortData();
+        this.renderTable();
+        this.updateSortIndicators();
+    }
+    handleColumnSort(column) {
+        if (this.currentSort.column === column) {
+            this.currentSort.ascending = !this.currentSort.ascending;
+        } else {
+            this.currentSort.column = column;
+            this.currentSort.ascending = false;
+        }
+        document.getElementById('sortSelect').value = column;
+        this.sortData();
+        this.renderTable();
+        this.updateSortIndicators();
+        this.updateSortOrderButton();
+    }
+    toggleSortOrder() {
+        this.currentSort.ascending = !this.currentSort.ascending;
+        this.sortData();
+        this.renderTable();
+        this.updateSortOrderButton();
+    }
+    sortData() {
+        const { column, ascending } = this.currentSort;
+        this.filteredData.sort((a, b) => {
+            let aValue = a[column];
+            let bValue = b[column];
+            if (typeof aValue === 'string') {
+                aValue = aValue.toLowerCase();
+                bValue = bValue.toLowerCase();
+            }
+            let comparison = 0;
+            if (aValue > bValue) comparison = 1;
+            if (aValue < bValue) comparison = -1;
+            return ascending ? comparison : -comparison;
+        });
+    }
+    renderTable() {
+        const tableBody = document.getElementById('tableBody');
+        if (this.filteredData.length === 0) {
+            tableBody.innerHTML = `
+                <tr>
+                    <td colspan="9" class="no-results">
+                        <i class="fas fa-search"></i>
+                        No models found matching your search criteria
+                    </td>
+                </tr>
+            `;
+            return;
+        }
+        tableBody.innerHTML = this.filteredData
+            .map((model) => this.createTableRow(model))
+            .join('');
+    }
+    createTableRow(model) {
+        return `
+            <tr>
+                <td class="model-col">
+                    <span class="model-name">${model.name}</span>
+                </td>
+                <td class="score-col">
+                    <span class="score ${this.getScoreClass(model.overall_score)}">
+                        ${model.overall_score.toFixed(3)}
+                    </span>
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.valid_tool_name_rate, true)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.schema_compliance, true)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.execution_success, true)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.task_fulfillment)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.information_grounding)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.tool_appropriateness)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.parameter_accuracy)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.dependency_awareness)}
+                </td>
+                <td class="metric-col">
+                    ${this.createMetricCell(model.parallelism_efficiency)}
+                </td>
+            </tr>
+        `;
+    }
+    createMetricCell(value, isPercentage = false) {
+        const displayValue = isPercentage ?
+            `${value.toFixed(1)}%` :
+            value.toFixed(3);
+        const normalizedValue = isPercentage ? value / 100 : value;
+        const scoreClass = this.getScoreClass(normalizedValue);
+        const barWidth = isPercentage ? value : (value * 100);
+        return `
+            <div class="metric" data-tooltip="${displayValue}">
+                <div class="metric-bar ${scoreClass}" style="width: ${barWidth}%"></div>
+                <span>${displayValue}</span>
+            </div>
+        `;
+    }
+    getRankClass(rank) {
+        if (rank === 1) return 'top-1';
+        if (rank <= 3) return 'top-3';
+        if (rank <= 5) return 'top-5';
+        return '';
+    }
+    getScoreClass(score) {
+        if (score >= 0.7) return 'excellent';
+        if (score >= 0.6) return 'good';
+        if (score >= 0.5) return 'average';
+        return 'poor';
+    }
+    updateSortIndicators() {
+        const headers = document.querySelectorAll('.sortable');
+        headers.forEach(header => {
+            header.classList.remove('active');
+            const icon = header.querySelector('.sort-icon');
+            icon.className = 'fas fa-sort sort-icon';
+        });
+        const activeHeader = document.querySelector(`[data-column="${this.currentSort.column}"]`);
+        if (activeHeader) {
+            activeHeader.classList.add('active');
+            const icon = activeHeader.querySelector('.sort-icon');
+            icon.className = this.currentSort.ascending ?
+                'fas fa-sort-up sort-icon' :
+                'fas fa-sort-down sort-icon';
+        }
+    }
+    updateSortOrderButton() {
+        const sortOrderButton = document.getElementById('sortOrder');
+        const icon = sortOrderButton.querySelector('i');
+        icon.className = this.currentSort.ascending ?
+            'fas fa-sort-amount-up' :
+            'fas fa-sort-amount-down';
+        sortOrderButton.title = this.currentSort.ascending ?
+            'Sort descending' :
+            'Sort ascending';
+    }
+    updateLastUpdated() {
+        const lastUpdatedElement = document.getElementById('lastUpdated');
+        if (this.data && this.data.lastUpdated) {
+            const date = new Date(this.data.lastUpdated);
+            lastUpdatedElement.textContent = date.toLocaleDateString('en-US', {
+                year: 'numeric',
+                month: 'long',
+                day: 'numeric'
+            });
+        } else {
+            lastUpdatedElement.textContent = new Date().toLocaleDateString('en-US', {
+                year: 'numeric',
+                month: 'long',
+                day: 'numeric'
+            });
+        }
+    }
+    showError(message) {
+        const tableBody = document.getElementById('tableBody');
+        tableBody.innerHTML = `
+            <tr>
+                <td colspan="9" class="no-results">
+                    <i class="fas fa-exclamation-triangle"></i>
+                    ${message}
+                </td>
+            </tr>
+        `;
+    }
+}
+// Copy citation function
+function copyCitation() {
+    const citationText = `@article{wang2024mcpbench,
+  title={MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers},
+  author={Wang, Zhenting and Chang, Qi and Patel, Hemani and Biju, Shashank and Wu, Cheng-En and Liu, Quan and Ding, Aolin and Rezazadeh, Alireza and Shah, Ankit and Bao, Yujia and Siow, Eugene},
+  journal={arXiv preprint arXiv:2508.20453},
+  year={2024}
+}`;
+    if (navigator.clipboard && window.isSecureContext) {
+        navigator.clipboard.writeText(citationText).then(() => {
+            showCopySuccess();
+        }).catch(err => {
+            console.error('Failed to copy citation:', err);
+            fallbackCopy(citationText);
+        });
+    } else {
+        fallbackCopy(citationText);
+    }
+}
+function fallbackCopy(text) {
+    const textArea = document.createElement('textarea');
+    textArea.value = text;
+    textArea.style.position = 'fixed';
+    textArea.style.left = '-999999px';
+    textArea.style.top = '-999999px';
+    document.body.appendChild(textArea);
+    textArea.focus();
+    textArea.select();
+    try {
+        document.execCommand('copy');
+        showCopySuccess();
+    } catch (err) {
+        console.error('Fallback copy failed:', err);
+    }
+    document.body.removeChild(textArea);
+}
+function showCopySuccess() {
+    const button = document.querySelector('.copy-citation-btn');
+    const originalText = button.innerHTML;
+    button.innerHTML = '<i class="fas fa-check"></i> Copied!';
+    button.style.backgroundColor = '#10b981';
+    setTimeout(() => {
+        button.innerHTML = originalText;
+        button.style.backgroundColor = '';
+    }, 2000);
+}
+document.addEventListener('DOMContentLoaded', () => {
+    new LeaderboardApp();
+});
+if ('serviceWorker' in navigator) {
+    window.addEventListener('load', () => {
+        navigator.serviceWorker.register('/sw.js')
+            .then((registration) => {
+                console.log('SW registered: ', registration);
+            })
+            .catch((registrationError) => {
+                console.log('SW registration failed: ', registrationError);
+            });
+    });
+}

style.css CHANGED Viewed

@@ -1,28 +1,625 @@
 body {
-	padding: 2rem;
-	font-family: -apple-system, BlinkMacSystemFont, "Arial", sans-serif;
 }
-h1 {
-	font-size: 16px;
-	margin-top: 0;
 }
-p {
-	color: rgb(107, 114, 128);
-	font-size: 15px;
-	margin-bottom: 10px;
-	margin-top: 5px;
 }
-.card {
-	max-width: 620px;
-	margin: 0 auto;
-	padding: 16px;
-	border: 1px solid lightgray;
-	border-radius: 16px;
 }
-.card p:last-child {
-	margin-bottom: 0;
 }

+* {
+    margin: 0;
+    padding: 0;
+    box-sizing: border-box;
+}
+:root {
+    --primary-color: #2563eb;
+    --secondary-color: #1e40af;
+    --accent-color: #f59e0b;
+    --success-color: #10b981;
+    --warning-color: #f59e0b;
+    --error-color: #ef4444;
+    --bg-primary: #ffffff;
+    --bg-secondary: #f8fafc;
+    --bg-tertiary: #f1f5f9;
+    --text-primary: #1e293b;
+    --text-secondary: #64748b;
+    --text-muted: #94a3b8;
+    --border-color: #e2e8f0;
+    --shadow-sm: 0 1px 2px 0 rgb(0 0 0 / 0.05);
+    --shadow-md: 0 4px 6px -1px rgb(0 0 0 / 0.1), 0 2px 4px -2px rgb(0 0 0 / 0.1);
+    --shadow-lg: 0 10px 15px -3px rgb(0 0 0 / 0.1), 0 4px 6px -4px rgb(0 0 0 / 0.1);
+    --border-radius: 8px;
+    --border-radius-lg: 12px;
+    --font-family: 'Inter', -apple-system, BlinkMacSystemFont, 'Segoe UI', Roboto, sans-serif;
+}
+@media (prefers-color-scheme: dark) {
+    :root {
+        --bg-primary: #0f172a;
+        --bg-secondary: #1e293b;
+        --bg-tertiary: #334155;
+        --text-primary: #f8fafc;
+        --text-secondary: #cbd5e1;
+        --text-muted: #64748b;
+        --border-color: #334155;
+    }
+}
 body {
+    font-family: var(--font-family);
+    background-color: var(--bg-secondary);
+    color: var(--text-primary);
+    line-height: 1.6;
+}
+.container {
+    max-width: 1400px;
+    margin: 0 auto;
+    padding: 20px;
+}
+.paper-header {
+    text-align: center;
+    margin-bottom: 40px;
+    padding: 30px 20px;
+}
+.paper-title {
+    font-size: 2.2rem;
+    font-weight: 700;
+    margin-bottom: 20px;
+    color: var(--text-primary);
+    line-height: 1.2;
+}
+.paper-authors {
+    color: var(--text-secondary);
+}
+.paper-authors p {
+    margin: 5px 0;
+    font-size: 1.1rem;
+}
+.affiliation {
+    font-style: italic;
+    margin-top: 10px;
+}
+.paper-links {
+    display: flex;
+    justify-content: center;
+    gap: 20px;
+    margin-top: 20px;
+}
+.paper-link {
+    display: inline-flex;
+    align-items: center;
+    gap: 8px;
+    padding: 10px 20px;
+    background-color: var(--primary-color);
+    color: white;
+    text-decoration: none;
+    border-radius: var(--border-radius);
+    font-weight: 500;
+    transition: all 0.3s ease;
+    box-shadow: var(--shadow-sm);
+}
+.paper-link:hover {
+    background-color: var(--secondary-color);
+    transform: translateY(-2px);
+    box-shadow: var(--shadow-md);
+    color: white;
+    text-decoration: none;
+}
+.diagram-section {
+    text-align: center;
+    margin-bottom: 40px;
+    padding: 20px;
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    box-shadow: var(--shadow-md);
+}
+.diagram-image {
+    max-width: 100%;
+    height: auto;
+    border-radius: var(--border-radius);
+    margin-bottom: 20px;
+}
+.diagram-caption {
+    text-align: justify;
+    line-height: 1.6;
+    color: var(--text-secondary);
+    font-size: 1.1rem;
+    margin: 0 auto;
+    max-width: 1200px;
+    padding: 0 20px;
+}
+.chart-section {
+    margin-bottom: 40px;
+    padding: 20px;
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    box-shadow: var(--shadow-md);
+}
+.section-title {
+    font-size: 1.8rem;
+    font-weight: 600;
+    margin-bottom: 20px;
+    color: var(--text-primary);
+    text-align: center;
+}
+.ranking-chart {
+    max-width: 100%;
+    height: auto;
+    display: block;
+    margin: 0 auto;
+}
+.leaderboard-section {
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    padding: 20px;
+    box-shadow: var(--shadow-md);
+}
+.controls {
+    display: flex;
+    flex-wrap: wrap;
+    gap: 20px;
+    align-items: center;
+    justify-content: space-between;
+    margin-bottom: 30px;
+    padding: 25px;
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    box-shadow: var(--shadow-md);
 }
+.search-container {
+    display: flex;
+    align-items: center;
+    position: relative;
+    flex: 1;
+    min-width: 300px;
 }
+.search-container i {
+    position: absolute;
+    left: 15px;
+    color: var(--text-muted);
+    z-index: 1;
 }
+.search-input {
+    width: 100%;
+    padding: 12px 15px 12px 45px;
+    border: 2px solid var(--border-color);
+    border-radius: var(--border-radius);
+    font-size: 1rem;
+    background-color: var(--bg-secondary);
+    color: var(--text-primary);
+    transition: all 0.3s ease;
 }
+.search-input:focus {
+    outline: none;
+    border-color: var(--primary-color);
+    box-shadow: 0 0 0 3px rgb(37 99 235 / 0.1);
 }
+.filter-container {
+    display: flex;
+    align-items: center;
+    gap: 15px;
+}
+.filter-container label {
+    font-weight: 500;
+    color: var(--text-secondary);
+}
+.sort-select {
+    padding: 10px 15px;
+    border: 2px solid var(--border-color);
+    border-radius: var(--border-radius);
+    background-color: var(--bg-secondary);
+    color: var(--text-primary);
+    font-size: 1rem;
+    cursor: pointer;
+    transition: border-color 0.3s ease;
+}
+.sort-select:focus {
+    outline: none;
+    border-color: var(--primary-color);
+}
+.sort-btn {
+    padding: 10px 12px;
+    border: 2px solid var(--border-color);
+    border-radius: var(--border-radius);
+    background-color: var(--bg-secondary);
+    color: var(--text-secondary);
+    cursor: pointer;
+    transition: all 0.3s ease;
+}
+.sort-btn:hover {
+    background-color: var(--primary-color);
+    color: white;
+    border-color: var(--primary-color);
+}
+.table-container {
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    overflow: hidden;
+    box-shadow: var(--shadow-lg);
+    margin-bottom: 30px;
+}
+.leaderboard-table {
+    width: 100%;
+    border-collapse: collapse;
+    font-size: 0.9rem;
+}
+.leaderboard-table thead {
+    background: linear-gradient(135deg, var(--bg-tertiary), var(--bg-secondary));
+}
+.leaderboard-table th,
+.leaderboard-table td {
+    padding: 15px 12px;
+    text-align: left;
+    border-bottom: 1px solid var(--border-color);
+}
+.leaderboard-table th {
+    font-weight: 600;
+    color: var(--text-primary);
+    position: sticky;
+    top: 0;
+    background: var(--bg-tertiary);
+    z-index: 10;
+}
+.sortable {
+    cursor: pointer;
+    user-select: none;
+    transition: background-color 0.2s ease;
+    position: relative;
+}
+.sortable:hover {
+    background-color: var(--border-color);
+}
+.sort-icon {
+    margin-left: 8px;
+    opacity: 0.5;
+    transition: opacity 0.2s ease;
+}
+.sortable:hover .sort-icon {
+    opacity: 1;
+}
+.sortable.active .sort-icon {
+    opacity: 1;
+    color: var(--primary-color);
+}
+.rank-col {
+    width: 80px;
+    text-align: center;
+}
+.model-col {
+    width: 250px;
+    min-width: 200px;
+}
+.score-col {
+    width: 120px;
+}
+.metric-col {
+    width: 110px;
+    text-align: center;
+}
+.leaderboard-table tbody tr {
+    transition: all 0.2s ease;
+}
+.leaderboard-table tbody tr:hover {
+    background-color: var(--bg-secondary);
+    transform: translateY(-1px);
+    box-shadow: 0 4px 8px rgb(0 0 0 / 0.1);
+}
+.rank {
+    font-weight: 700;
+    font-size: 1.1rem;
+    color: var(--primary-color);
+}
+.rank.top-1 {
+    color: #fbbf24;
+}
+.rank.top-3 {
+    color: #f59e0b;
+}
+.rank.top-5 {
+    color: #10b981;
+}
+.model-name {
+    font-weight: 600;
+    color: var(--text-primary);
+}
+.score {
+    font-weight: 600;
+    padding: 6px 12px;
+    border-radius: var(--border-radius);
+    color: white;
+    text-align: center;
+}
+.score.excellent {
+    background: linear-gradient(135deg, #10b981, #059669);
+}
+.score.good {
+    background: linear-gradient(135deg, #3b82f6, #2563eb);
+}
+.score.average {
+    background: linear-gradient(135deg, #f59e0b, #d97706);
+}
+.score.poor {
+    background: linear-gradient(135deg, #ef4444, #dc2626);
+}
+.metric {
+    position: relative;
+}
+.metric-bar {
+    position: absolute;
+    left: 0;
+    top: 0;
+    height: 100%;
+    border-radius: var(--border-radius);
+    opacity: 0.1;
+    transition: opacity 0.3s ease;
+}
+.metric-bar.excellent {
+    background-color: var(--success-color);
+}
+.metric-bar.good {
+    background-color: var(--primary-color);
+}
+.metric-bar.average {
+    background-color: var(--warning-color);
+}
+.metric-bar.poor {
+    background-color: var(--error-color);
+}
+.leaderboard-table tbody tr:hover .metric-bar {
+    opacity: 0.2;
+}
+.loading {
+    text-align: center;
+    padding: 60px 20px;
+    color: var(--text-secondary);
+    font-size: 1.1rem;
+    display: none;
+}
+.loading.active {
+    display: block;
+}
+.loading i {
+    font-size: 2rem;
+    margin-bottom: 15px;
+    display: block;
+    color: var(--primary-color);
+}
+.citation-section {
+    margin-bottom: 40px;
+    padding: 20px;
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    box-shadow: var(--shadow-md);
+}
+.citation-box {
+    position: relative;
+    background-color: var(--bg-secondary);
+    border: 1px solid var(--border-color);
+    border-radius: var(--border-radius);
+    padding: 20px;
+    margin-top: 20px;
+}
+.citation-text {
+    font-family: 'Monaco', 'Menlo', 'Ubuntu Mono', monospace;
+    font-size: 0.85rem;
+    line-height: 1.4;
+    color: var(--text-primary);
+    margin: 0;
+    padding: 0;
+    white-space: pre-wrap;
+    overflow-x: auto;
+}
+.copy-citation-btn {
+    position: absolute;
+    top: 15px;
+    right: 15px;
+    background-color: var(--primary-color);
+    color: white;
+    border: none;
+    border-radius: var(--border-radius);
+    padding: 8px 12px;
+    font-size: 0.85rem;
+    cursor: pointer;
+    transition: all 0.3s ease;
+    display: flex;
+    align-items: center;
+    gap: 5px;
+}
+.copy-citation-btn:hover {
+    background-color: var(--secondary-color);
+    transform: translateY(-1px);
+}
+.copy-citation-btn:active {
+    transform: translateY(0);
+}
+.footer {
+    text-align: center;
+    padding: 30px 20px;
+    color: var(--text-muted);
+    background-color: var(--bg-primary);
+    border-radius: var(--border-radius-lg);
+    margin-top: 40px;
+}
+.footer p {
+    margin-bottom: 5px;
+}
+@media (max-width: 1200px) {
+    .table-container {
+        overflow-x: auto;
+    }
+    .leaderboard-table {
+        min-width: 1000px;
+    }
+}
+@media (max-width: 768px) {
+    .container {
+        padding: 15px;
+    }
+    .title {
+        font-size: 2rem;
+    }
+    .controls {
+        flex-direction: column;
+        align-items: stretch;
+    }
+    .search-container {
+        min-width: auto;
+    }
+    .filter-container {
+        justify-content: space-between;
+    }
+    .leaderboard-table th,
+    .leaderboard-table td {
+        padding: 10px 8px;
+        font-size: 0.8rem;
+    }
+    .model-col {
+        width: 180px;
+        min-width: 160px;
+    }
+    .metric-col {
+        width: 90px;
+    }
+}
+@media (max-width: 480px) {
+    .title {
+        font-size: 1.5rem;
+        flex-direction: column;
+        gap: 10px;
+    }
+    .leaderboard-table {
+        min-width: 800px;
+    }
+    .leaderboard-table th,
+    .leaderboard-table td {
+        padding: 8px 6px;
+    }
+}
+.no-results {
+    text-align: center;
+    padding: 60px 20px;
+    color: var(--text-muted);
+    font-size: 1.1rem;
+}
+.no-results i {
+    font-size: 3rem;
+    margin-bottom: 20px;
+    display: block;
+    opacity: 0.5;
+}
+@keyframes fadeIn {
+    from { opacity: 0; transform: translateY(10px); }
+    to { opacity: 1; transform: translateY(0); }
+}
+.leaderboard-table tbody tr {
+    animation: fadeIn 0.3s ease forwards;
+}
+.tooltip {
+    position: relative;
+    cursor: help;
+}
+.tooltip:hover::after {
+    content: attr(data-tooltip);
+    position: absolute;
+    bottom: 100%;
+    left: 50%;
+    transform: translateX(-50%);
+    background-color: var(--text-primary);
+    color: var(--bg-primary);
+    padding: 8px 12px;
+    border-radius: var(--border-radius);
+    font-size: 0.8rem;
+    white-space: nowrap;
+    z-index: 1000;
+    box-shadow: var(--shadow-md);
+}