richardyoung
/

llm-instruction-following-code

+---
+license: mit
+tags:
+- instruction-following
+- llm-evaluation
+- benchmark
+- reproducibility
+- openrouter
+language:
+- en
+pretty_name: LLM Instruction-Following Evaluation Code
+---
+# LLM Instruction-Following Evaluation Framework - Code Repository
+[![Paper](https://img.shields.io/badge/arXiv-2510.18892-b31b1b.svg)](http://arxiv.org/abs/2510.18892)
+[![Dataset](https://img.shields.io/badge/🤗-Dataset-yellow.svg)](https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval)
+[![Python](https://img.shields.io/badge/Python-3.7+-blue.svg)](https://www.python.org/)
+[![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
+This repository contains the complete evaluation framework used in our paper **"When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs"** (arXiv:2510.18892).
+## 📋 What's Included
+This code repository provides everything needed to:
+- ✅ Reproduce our evaluation of 256 models across 20 diagnostic tests
+- ✅ Run the evaluation on new models
+- ✅ Add your own custom instruction-following tests
+- ✅ Generate publication-quality visualizations
+- ✅ Export results to multiple formats (Excel, JSON, LaTeX)
+## 🚀 Quick Start
+### Installation
+```bash
+# Clone the repository or download files
+pip install pandas openpyxl requests matplotlib seaborn numpy
+# Set your OpenRouter API key
+export OPENROUTER_API_KEY="your_api_key_here"
+```
+### Run Evaluation
+```bash
+# Run comprehensive evaluation (256 models × 20 tests)
+python test_comprehensive_20_verified.py
+# Generate analysis and visualizations
+python analyze_comprehensive_final.py
+```
+## 📁 Key Files
+### Core Evaluation
+- **`test_comprehensive_20_verified.py`** - Main test runner
+  - Evaluates models across all 20 diagnostic tests
+  - Exact-match evaluation with normalized whitespace
+  - Exports results to Excel with multiple sheets
+  - ~6-8 hours for full 256-model evaluation
+- **`questions.json`** - Complete test bank (20 diagnostic prompts)
+  - Each test includes: prompt, expected output, category, difficulty
+  - Covers 5 categories: String Manipulation, Constraint Compliance, Text Processing, Structured Data, Complex Operations
+  - Frozen version used for paper evaluation
+- **`models_verified_working_v2_20251014_091649.py`** - Model configuration
+  - 256 verified working models from OpenRouter
+  - Pre-verified for basic functionality
+  - Includes provider information
+### Analysis & Visualization
+- **`analyze_comprehensive_final.py`** - Comprehensive analysis pipeline
+  - Generates 4 publication-quality PDF figures
+  - Creates LaTeX tables for paper integration
+  - Computes statistical summaries
+  - Category and provider performance breakdowns
+### Supporting Files
+- **`requirements.txt`** - Python dependencies
+- **`README.md`** - This file (setup and usage instructions)
+## 🧪 Test Categories
+Our 20 diagnostic tests cover five categories:
+### 1. String Manipulation (Tests 1, 3, 5, 17, 20) - HARDEST
+- Multi-step text transformations
+- Average pass rate: 12.0%
+- Example: Test 5 (Complex String Transformation) - only 2.7% pass rate
+### 2. Constraint Compliance (Tests 2, 9, 15) - EASIEST
+- Following exact output specifications
+- Average pass rate: 66.9%
+- Example: Test 2 (Exact Output Compliance) - 96.1% pass rate
+### 3. Text Processing (Test 13)
+- Targeted text manipulation tasks
+- Average pass rate: 50.5%
+### 4. Structured Data (Tests 4, 6, 10, 12, 14)
+- JSON, Markdown, CSV generation
+- Average pass rate: 41.1%
+### 5. Complex Operations (Tests 7, 8, 11, 16, 18, 19)
+- Multi-step reasoning and computation
+- Average pass rate: 35.0%
+## 📊 Evaluation Methodology
+### Exact Match Evaluation
+- **Binary Pass/Fail**: No partial credit
+- **Whitespace Normalized**: Leading/trailing spaces ignored
+- **Case Sensitive**: Preserves intentional capitalization
+- **Format Strict**: JSON, tables, special characters must be exact
+### Why Exact Match?
+1. **Objectivity** - Eliminates subjective judgment
+2. **Reproducibility** - Deterministic, verifiable results
+3. **Clarity** - Binary success/failure (no ambiguity)
+4. **Efficiency** - No manual review needed
+5. **Diagnostic Power** - Reveals specific failure modes
+## 📈 Results Summary
+From our October 14, 2025 evaluation of 256 models:
+- **Overall Pass Rate**: 43.7%
+- **Best Model**: qwen/qwen-plus-2025-07-28:thinking (100%)
+- **Most Difficult Test**: Test 5 - Complex String Transformation (2.7%)
+- **Top Provider**: x-ai (79.3% average across 15 models)
+## 🔧 Customization
+### Adding New Tests
+Edit `questions.json` to add new diagnostic tests:
+```json
+{
+  "id": 21,
+  "test_name": "Your New Test",
+  "category": "Custom Category",
+  "difficulty": "medium",
+  "prompt": "Your instruction prompt here",
+  "expected_output": "Exact expected response",
+  "exact_match": true,
+  "case_sensitive": false
+}
+```
+### Testing Custom Models
+Modify `models_verified_working_v2_20251014_091649.py` or create your own model list:
+```python
+MODELS = [
+    {
+        "name": "provider/model-name",
+        "provider": "provider",
+        "verified": True
+    },
+    # Add more models...
+]
+```
+### Adjusting Analysis
+Customize `analyze_comprehensive_final.py` to:
+- Change visualization styles
+- Add new analysis metrics
+- Modify export formats
+- Create custom reports
+## 📦 Output Files
+The evaluation produces:
+1. **Excel Workbook** (`comprehensive_20_tests_results_YYYYMMDD_HHMMSS.xlsx`)
+   - Overview sheet with summary statistics
+   - Model rankings (sorted by performance)
+   - Test difficulty analysis
+   - Category performance breakdown
+   - Complete raw results (all 5,120 evaluations)
+   - Test descriptions
+2. **JSON Export** (`comprehensive_20_tests_results_YYYYMMDD_HHMMSS.json`)
+   - Machine-readable format
+   - Includes metadata and timestamps
+   - All test results with responses
+3. **PDF Visualizations**
+   - `fig1_heatmap.pdf` - Performance matrix
+   - `fig2_provider.pdf` - Provider comparison
+   - `fig3_difficulty.pdf` - Test difficulty
+   - `fig4_category.pdf` - Category performance
+4. **LaTeX Tables** (`paper_tables.tex`)
+   - Ready for paper integration
+   - Formatted with booktabs package
+## 🔍 Reproducibility
+To exactly reproduce our paper results:
+```bash
+# Use the frozen model list from October 14, 2025
+python test_comprehensive_20_verified.py
+# Use the frozen test bank
+# (questions.json is already frozen at 20 tests)
+# Generate analysis with same parameters
+python analyze_comprehensive_final.py
+```
+**Note**: Model outputs may vary over time as providers update their models. For exact reproducibility, use the snapshot from our evaluation date.
+## 💡 Usage Examples
+### Quick Test (5 models)
+```python
+# Edit test_comprehensive_20_verified.py
+# Change MODELS to a subset:
+MODELS = [
+    "openai/gpt-4o",
+    "anthropic/claude-3.7-sonnet",
+    "google/gemini-2.0-flash-exp:free",
+    "meta-llama/llama-3.3-70b-instruct",
+    "qwen/qwen-plus-2025-07-28:thinking"
+]
+```
+### Single Model Test
+```python
+import requests
+import json
+# Load questions
+with open('questions.json', 'r') as f:
+    questions = json.load(f)
+# Test a single model
+model = "openai/gpt-4o"
+for q in questions:
+    response = requests.post(
+        "https://openrouter.ai/api/v1/chat/completions",
+        headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
+        json={
+            "model": model,
+            "messages": [{"role": "user", "content": q["prompt"]}]
+        }
+    )
+    # Evaluate response...
+```
+### Custom Analysis
+```python
+import pandas as pd
+# Load results
+df = pd.read_excel('results.xlsx', sheet_name='All Results')
+# Custom analysis
+top_models = df.groupby('model')['passed'].mean().sort_values(ascending=False).head(10)
+print(top_models)
+# Category performance
+category_perf = df.groupby('category')['passed'].mean()
+print(category_perf)
+```
+## 🐛 Troubleshooting
+### Common Issues
+**1. API Rate Limiting**
+```bash
+# OpenRouter may rate limit. Add delays between requests:
+time.sleep(1)  # Add to test_comprehensive_20_verified.py
+```
+**2. JSON Serialization Errors**
+```bash
+# Use export_json_from_excel.py to convert numpy types
+python export_json_from_excel.py
+```
+**3. Missing Packages**
+```bash
+pip install pandas openpyxl requests matplotlib seaborn numpy
+```
+**4. API Key Not Set**
+```bash
+export OPENROUTER_API_KEY="your_key_here"
+# Or set in Python: os.environ['OPENROUTER_API_KEY'] = "your_key"
+```
+## 📚 Citation
+If you use this code in your research, please cite:
+```bibtex
+@article{young2025instruction,
+  title={When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs},
+  author={Young, Richard J. and Gillins, Brandon and Matthews, Alice M.},
+  journal={arXiv preprint arXiv:2510.18892},
+  year={2025}
+}
+```
+## 🔗 Related Resources
+- **Paper**: http://arxiv.org/abs/2510.18892
+- **Dataset**: https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval
+- **Paper Repository**: https://huggingface.co/richardyoung/llm-instruction-following-paper
+## 📞 Contact
+**Research Team:**
+- Richard J. Young - ryoung@unlv.edu
+- Brandon Gillins - bgillins@unlv.edu
+- Alice M. Matthews - amatthews@unlv.edu
+**Affiliation:** University of Nevada, Las Vegas
+## 🙏 Acknowledgments
+- **OpenRouter** for unified API access to 256+ models
+- **Model Providers** (OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and others)
+- Open source community for evaluation tools and frameworks
+## 📜 License
+This code is released under the **MIT License**.
+```
+MIT License
+Copyright (c) 2025 Richard J. Young, Brandon Gillins, Alice M. Matthews
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+```
+---
+**Repository Version:** 1.0
+**Last Updated:** October 23, 2025
+**Evaluation Date:** October 14, 2025