richardyoung commited on
Commit
53bbf9e
Β·
verified Β·
1 Parent(s): 7a5749a

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +370 -0
README.md ADDED
@@ -0,0 +1,370 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ tags:
4
+ - instruction-following
5
+ - llm-evaluation
6
+ - benchmark
7
+ - reproducibility
8
+ - openrouter
9
+ language:
10
+ - en
11
+ pretty_name: LLM Instruction-Following Evaluation Code
12
+ ---
13
+
14
+ # LLM Instruction-Following Evaluation Framework - Code Repository
15
+
16
+ [![Paper](https://img.shields.io/badge/arXiv-2510.18892-b31b1b.svg)](http://arxiv.org/abs/2510.18892)
17
+ [![Dataset](https://img.shields.io/badge/πŸ€—-Dataset-yellow.svg)](https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval)
18
+ [![Python](https://img.shields.io/badge/Python-3.7+-blue.svg)](https://www.python.org/)
19
+ [![License](https://img.shields.io/badge/License-MIT-green.svg)](https://opensource.org/licenses/MIT)
20
+
21
+ This repository contains the complete evaluation framework used in our paper **"When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs"** (arXiv:2510.18892).
22
+
23
+ ## πŸ“‹ What's Included
24
+
25
+ This code repository provides everything needed to:
26
+ - βœ… Reproduce our evaluation of 256 models across 20 diagnostic tests
27
+ - βœ… Run the evaluation on new models
28
+ - βœ… Add your own custom instruction-following tests
29
+ - βœ… Generate publication-quality visualizations
30
+ - βœ… Export results to multiple formats (Excel, JSON, LaTeX)
31
+
32
+ ## πŸš€ Quick Start
33
+
34
+ ### Installation
35
+
36
+ ```bash
37
+ # Clone the repository or download files
38
+ pip install pandas openpyxl requests matplotlib seaborn numpy
39
+
40
+ # Set your OpenRouter API key
41
+ export OPENROUTER_API_KEY="your_api_key_here"
42
+ ```
43
+
44
+ ### Run Evaluation
45
+
46
+ ```bash
47
+ # Run comprehensive evaluation (256 models Γ— 20 tests)
48
+ python test_comprehensive_20_verified.py
49
+
50
+ # Generate analysis and visualizations
51
+ python analyze_comprehensive_final.py
52
+ ```
53
+
54
+ ## πŸ“ Key Files
55
+
56
+ ### Core Evaluation
57
+ - **`test_comprehensive_20_verified.py`** - Main test runner
58
+ - Evaluates models across all 20 diagnostic tests
59
+ - Exact-match evaluation with normalized whitespace
60
+ - Exports results to Excel with multiple sheets
61
+ - ~6-8 hours for full 256-model evaluation
62
+
63
+ - **`questions.json`** - Complete test bank (20 diagnostic prompts)
64
+ - Each test includes: prompt, expected output, category, difficulty
65
+ - Covers 5 categories: String Manipulation, Constraint Compliance, Text Processing, Structured Data, Complex Operations
66
+ - Frozen version used for paper evaluation
67
+
68
+ - **`models_verified_working_v2_20251014_091649.py`** - Model configuration
69
+ - 256 verified working models from OpenRouter
70
+ - Pre-verified for basic functionality
71
+ - Includes provider information
72
+
73
+ ### Analysis & Visualization
74
+ - **`analyze_comprehensive_final.py`** - Comprehensive analysis pipeline
75
+ - Generates 4 publication-quality PDF figures
76
+ - Creates LaTeX tables for paper integration
77
+ - Computes statistical summaries
78
+ - Category and provider performance breakdowns
79
+
80
+ ### Supporting Files
81
+ - **`requirements.txt`** - Python dependencies
82
+ - **`README.md`** - This file (setup and usage instructions)
83
+
84
+ ## πŸ§ͺ Test Categories
85
+
86
+ Our 20 diagnostic tests cover five categories:
87
+
88
+ ### 1. String Manipulation (Tests 1, 3, 5, 17, 20) - HARDEST
89
+ - Multi-step text transformations
90
+ - Average pass rate: 12.0%
91
+ - Example: Test 5 (Complex String Transformation) - only 2.7% pass rate
92
+
93
+ ### 2. Constraint Compliance (Tests 2, 9, 15) - EASIEST
94
+ - Following exact output specifications
95
+ - Average pass rate: 66.9%
96
+ - Example: Test 2 (Exact Output Compliance) - 96.1% pass rate
97
+
98
+ ### 3. Text Processing (Test 13)
99
+ - Targeted text manipulation tasks
100
+ - Average pass rate: 50.5%
101
+
102
+ ### 4. Structured Data (Tests 4, 6, 10, 12, 14)
103
+ - JSON, Markdown, CSV generation
104
+ - Average pass rate: 41.1%
105
+
106
+ ### 5. Complex Operations (Tests 7, 8, 11, 16, 18, 19)
107
+ - Multi-step reasoning and computation
108
+ - Average pass rate: 35.0%
109
+
110
+ ## πŸ“Š Evaluation Methodology
111
+
112
+ ### Exact Match Evaluation
113
+ - **Binary Pass/Fail**: No partial credit
114
+ - **Whitespace Normalized**: Leading/trailing spaces ignored
115
+ - **Case Sensitive**: Preserves intentional capitalization
116
+ - **Format Strict**: JSON, tables, special characters must be exact
117
+
118
+ ### Why Exact Match?
119
+ 1. **Objectivity** - Eliminates subjective judgment
120
+ 2. **Reproducibility** - Deterministic, verifiable results
121
+ 3. **Clarity** - Binary success/failure (no ambiguity)
122
+ 4. **Efficiency** - No manual review needed
123
+ 5. **Diagnostic Power** - Reveals specific failure modes
124
+
125
+ ## πŸ“ˆ Results Summary
126
+
127
+ From our October 14, 2025 evaluation of 256 models:
128
+
129
+ - **Overall Pass Rate**: 43.7%
130
+ - **Best Model**: qwen/qwen-plus-2025-07-28:thinking (100%)
131
+ - **Most Difficult Test**: Test 5 - Complex String Transformation (2.7%)
132
+ - **Top Provider**: x-ai (79.3% average across 15 models)
133
+
134
+ ## πŸ”§ Customization
135
+
136
+ ### Adding New Tests
137
+
138
+ Edit `questions.json` to add new diagnostic tests:
139
+
140
+ ```json
141
+ {
142
+ "id": 21,
143
+ "test_name": "Your New Test",
144
+ "category": "Custom Category",
145
+ "difficulty": "medium",
146
+ "prompt": "Your instruction prompt here",
147
+ "expected_output": "Exact expected response",
148
+ "exact_match": true,
149
+ "case_sensitive": false
150
+ }
151
+ ```
152
+
153
+ ### Testing Custom Models
154
+
155
+ Modify `models_verified_working_v2_20251014_091649.py` or create your own model list:
156
+
157
+ ```python
158
+ MODELS = [
159
+ {
160
+ "name": "provider/model-name",
161
+ "provider": "provider",
162
+ "verified": True
163
+ },
164
+ # Add more models...
165
+ ]
166
+ ```
167
+
168
+ ### Adjusting Analysis
169
+
170
+ Customize `analyze_comprehensive_final.py` to:
171
+ - Change visualization styles
172
+ - Add new analysis metrics
173
+ - Modify export formats
174
+ - Create custom reports
175
+
176
+ ## πŸ“¦ Output Files
177
+
178
+ The evaluation produces:
179
+
180
+ 1. **Excel Workbook** (`comprehensive_20_tests_results_YYYYMMDD_HHMMSS.xlsx`)
181
+ - Overview sheet with summary statistics
182
+ - Model rankings (sorted by performance)
183
+ - Test difficulty analysis
184
+ - Category performance breakdown
185
+ - Complete raw results (all 5,120 evaluations)
186
+ - Test descriptions
187
+
188
+ 2. **JSON Export** (`comprehensive_20_tests_results_YYYYMMDD_HHMMSS.json`)
189
+ - Machine-readable format
190
+ - Includes metadata and timestamps
191
+ - All test results with responses
192
+
193
+ 3. **PDF Visualizations**
194
+ - `fig1_heatmap.pdf` - Performance matrix
195
+ - `fig2_provider.pdf` - Provider comparison
196
+ - `fig3_difficulty.pdf` - Test difficulty
197
+ - `fig4_category.pdf` - Category performance
198
+
199
+ 4. **LaTeX Tables** (`paper_tables.tex`)
200
+ - Ready for paper integration
201
+ - Formatted with booktabs package
202
+
203
+ ## πŸ” Reproducibility
204
+
205
+ To exactly reproduce our paper results:
206
+
207
+ ```bash
208
+ # Use the frozen model list from October 14, 2025
209
+ python test_comprehensive_20_verified.py
210
+
211
+ # Use the frozen test bank
212
+ # (questions.json is already frozen at 20 tests)
213
+
214
+ # Generate analysis with same parameters
215
+ python analyze_comprehensive_final.py
216
+ ```
217
+
218
+ **Note**: Model outputs may vary over time as providers update their models. For exact reproducibility, use the snapshot from our evaluation date.
219
+
220
+ ## πŸ’‘ Usage Examples
221
+
222
+ ### Quick Test (5 models)
223
+
224
+ ```python
225
+ # Edit test_comprehensive_20_verified.py
226
+ # Change MODELS to a subset:
227
+ MODELS = [
228
+ "openai/gpt-4o",
229
+ "anthropic/claude-3.7-sonnet",
230
+ "google/gemini-2.0-flash-exp:free",
231
+ "meta-llama/llama-3.3-70b-instruct",
232
+ "qwen/qwen-plus-2025-07-28:thinking"
233
+ ]
234
+ ```
235
+
236
+ ### Single Model Test
237
+
238
+ ```python
239
+ import requests
240
+ import json
241
+
242
+ # Load questions
243
+ with open('questions.json', 'r') as f:
244
+ questions = json.load(f)
245
+
246
+ # Test a single model
247
+ model = "openai/gpt-4o"
248
+ for q in questions:
249
+ response = requests.post(
250
+ "https://openrouter.ai/api/v1/chat/completions",
251
+ headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
252
+ json={
253
+ "model": model,
254
+ "messages": [{"role": "user", "content": q["prompt"]}]
255
+ }
256
+ )
257
+ # Evaluate response...
258
+ ```
259
+
260
+ ### Custom Analysis
261
+
262
+ ```python
263
+ import pandas as pd
264
+
265
+ # Load results
266
+ df = pd.read_excel('results.xlsx', sheet_name='All Results')
267
+
268
+ # Custom analysis
269
+ top_models = df.groupby('model')['passed'].mean().sort_values(ascending=False).head(10)
270
+ print(top_models)
271
+
272
+ # Category performance
273
+ category_perf = df.groupby('category')['passed'].mean()
274
+ print(category_perf)
275
+ ```
276
+
277
+ ## πŸ› Troubleshooting
278
+
279
+ ### Common Issues
280
+
281
+ **1. API Rate Limiting**
282
+ ```bash
283
+ # OpenRouter may rate limit. Add delays between requests:
284
+ time.sleep(1) # Add to test_comprehensive_20_verified.py
285
+ ```
286
+
287
+ **2. JSON Serialization Errors**
288
+ ```bash
289
+ # Use export_json_from_excel.py to convert numpy types
290
+ python export_json_from_excel.py
291
+ ```
292
+
293
+ **3. Missing Packages**
294
+ ```bash
295
+ pip install pandas openpyxl requests matplotlib seaborn numpy
296
+ ```
297
+
298
+ **4. API Key Not Set**
299
+ ```bash
300
+ export OPENROUTER_API_KEY="your_key_here"
301
+ # Or set in Python: os.environ['OPENROUTER_API_KEY'] = "your_key"
302
+ ```
303
+
304
+ ## πŸ“š Citation
305
+
306
+ If you use this code in your research, please cite:
307
+
308
+ ```bibtex
309
+ @article{young2025instruction,
310
+ title={When Models Can't Follow: Testing Instruction Adherence Across 256 LLMs},
311
+ author={Young, Richard J. and Gillins, Brandon and Matthews, Alice M.},
312
+ journal={arXiv preprint arXiv:2510.18892},
313
+ year={2025}
314
+ }
315
+ ```
316
+
317
+ ## πŸ”— Related Resources
318
+
319
+ - **Paper**: http://arxiv.org/abs/2510.18892
320
+ - **Dataset**: https://huggingface.co/datasets/richardyoung/llm-instruction-following-eval
321
+ - **Paper Repository**: https://huggingface.co/richardyoung/llm-instruction-following-paper
322
+
323
+ ## πŸ“ž Contact
324
+
325
+ **Research Team:**
326
+ - Richard J. Young - ryoung@unlv.edu
327
+ - Brandon Gillins - bgillins@unlv.edu
328
+ - Alice M. Matthews - amatthews@unlv.edu
329
+
330
+ **Affiliation:** University of Nevada, Las Vegas
331
+
332
+ ## πŸ™ Acknowledgments
333
+
334
+ - **OpenRouter** for unified API access to 256+ models
335
+ - **Model Providers** (OpenAI, Anthropic, Google, Meta, Qwen, DeepSeek, x-ai, and others)
336
+ - Open source community for evaluation tools and frameworks
337
+
338
+ ## πŸ“œ License
339
+
340
+ This code is released under the **MIT License**.
341
+
342
+ ```
343
+ MIT License
344
+
345
+ Copyright (c) 2025 Richard J. Young, Brandon Gillins, Alice M. Matthews
346
+
347
+ Permission is hereby granted, free of charge, to any person obtaining a copy
348
+ of this software and associated documentation files (the "Software"), to deal
349
+ in the Software without restriction, including without limitation the rights
350
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
351
+ copies of the Software, and to permit persons to whom the Software is
352
+ furnished to do so, subject to the following conditions:
353
+
354
+ The above copyright notice and this permission notice shall be included in all
355
+ copies or substantial portions of the Software.
356
+
357
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
358
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
359
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
360
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
361
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
362
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
363
+ SOFTWARE.
364
+ ```
365
+
366
+ ---
367
+
368
+ **Repository Version:** 1.0
369
+ **Last Updated:** October 23, 2025
370
+ **Evaluation Date:** October 14, 2025