Spaces:

jatinganhotra
/

swe-bench-verified-discriminative-leaderboard

Sleeping

App Files Files Community

swe-bench-verified-discriminative-leaderboard / README.md

jatinganhotra

Upload folder using huggingface_hub

8d6baba verified 5 months ago

preview code

raw

history blame contribute delete

3.22 kB

A newer version of the Gradio SDK is available: 6.0.0

Upgrade

metadata

title: SWE-Bench Verified Discriminative Subsets Leaderboard
emoji: 🏆
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: mit
tags:
  - leaderboard
  - software-engineering
  - swe-bench
  - evaluation
  - benchmark

🏆 SWE-Bench Verified Discriminative Subsets Leaderboard

This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems.

🎯 Why Discriminative Subsets?

As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems.

📊 The Four Discriminative Subsets

🔥 Frontier Subset (95 instances): Problems solved by ≤5 agents - maximum evaluative sensitivity
- Combines unsolved, ultra-rare, and very-rare problems
- Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination)
⚡ Challenging Subset (155 instances): Problems solved by ≤20 agents - strong evaluative power
- Balances discrimination with statistical significance
- Includes frontier + rare and uncommon problems
💪 Hard Subset (45 instances): All Hard difficulty problems regardless of solve rate
- Traditional difficulty-based evaluation
- Focuses on problems originally classified as most difficult
📁 MultiFile Subset (40 instances): Multi-file problems solved by ≤10 agents
- Targets real-world complexity requiring coordinated edits
- Even leading agents achieve only 10% success rate

🔬 Methodology

Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents:

Problems solved by fewer agents provide better discrimination
Analysis covers submissions from October 2023 to May 2025
"Solved" means the agent's fix passed the verification test suite

📈 Key Insights

Enhanced Resolution: Frontier subset provides 6x better discrimination between top systems
Multi-file Complexity: Represents genuine software engineering challenges
Statistical Significance: Challenging subset offers robust evaluation with strong discrimination
Real Progress: Performance on these subsets indicates genuine capability advances

🔗 Resources

Blog Post: From 73% to 11%: Revealing True SWE-Agent Capabilities
Dataset: SWE-bench_Verified-discriminative
Original SWE-Bench: SWE-bench.com

🚀 Usage

from datasets import load_dataset

# Load specific discriminative subset
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging")

Created by Jatin Ganhotra | Last Updated: June 19 2025