A newer version of the Gradio SDK is available:
6.0.0
title: SWE-Bench Verified Discriminative Subsets Leaderboard
emoji: π
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: mit
tags:
- leaderboard
- software-engineering
- swe-bench
- evaluation
- benchmark
π SWE-Bench Verified Discriminative Subsets Leaderboard
This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems.
π― Why Discriminative Subsets?
As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems.
π The Four Discriminative Subsets
π₯ Frontier Subset (95 instances): Problems solved by β€5 agents - maximum evaluative sensitivity
- Combines unsolved, ultra-rare, and very-rare problems
- Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination)
β‘ Challenging Subset (155 instances): Problems solved by β€20 agents - strong evaluative power
- Balances discrimination with statistical significance
- Includes frontier + rare and uncommon problems
πͺ Hard Subset (45 instances): All Hard difficulty problems regardless of solve rate
- Traditional difficulty-based evaluation
- Focuses on problems originally classified as most difficult
π MultiFile Subset (40 instances): Multi-file problems solved by β€10 agents
- Targets real-world complexity requiring coordinated edits
- Even leading agents achieve only 10% success rate
π¬ Methodology
Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents:
- Problems solved by fewer agents provide better discrimination
- Analysis covers submissions from October 2023 to May 2025
- "Solved" means the agent's fix passed the verification test suite
π Key Insights
- Enhanced Resolution: Frontier subset provides 6x better discrimination between top systems
- Multi-file Complexity: Represents genuine software engineering challenges
- Statistical Significance: Challenging subset offers robust evaluation with strong discrimination
- Real Progress: Performance on these subsets indicates genuine capability advances
π Resources
- Blog Post: From 73% to 11%: Revealing True SWE-Agent Capabilities
- Dataset: SWE-bench_Verified-discriminative
- Original SWE-Bench: SWE-bench.com
π Usage
from datasets import load_dataset
# Load specific discriminative subset
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging")
Created by Jatin Ganhotra | Last Updated: June 19 2025