jatinganhotra's picture
Upload folder using huggingface_hub
8d6baba verified

A newer version of the Gradio SDK is available: 6.0.0

Upgrade
metadata
title: SWE-Bench Verified Discriminative Subsets Leaderboard
emoji: πŸ†
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 5.34.2
app_file: app.py
pinned: false
license: mit
tags:
  - leaderboard
  - software-engineering
  - swe-bench
  - evaluation
  - benchmark

πŸ† SWE-Bench Verified Discriminative Subsets Leaderboard

This interactive leaderboard displays SWE-agents performance across SWE-Bench_Verified and four discriminative subsets designed to provide enhanced evaluation sensitivity for state-of-the-art systems.

🎯 Why Discriminative Subsets?

As SWE-agents improve, achieving 70%+ success rates on the full SWE-Bench Verified benchmark, traditional evaluation loses discriminative power. These targeted subsets focus on the most challenging problems to better distinguish between top-tier systems.

πŸ“Š The Four Discriminative Subsets

  1. πŸ”₯ Frontier Subset (95 instances): Problems solved by ≀5 agents - maximum evaluative sensitivity

    • Combines unsolved, ultra-rare, and very-rare problems
    • Top agent: 11.6% vs 73.2% on full benchmark (6x better discrimination)
  2. ⚑ Challenging Subset (155 instances): Problems solved by ≀20 agents - strong evaluative power

    • Balances discrimination with statistical significance
    • Includes frontier + rare and uncommon problems
  3. πŸ’ͺ Hard Subset (45 instances): All Hard difficulty problems regardless of solve rate

    • Traditional difficulty-based evaluation
    • Focuses on problems originally classified as most difficult
  4. πŸ“ MultiFile Subset (40 instances): Multi-file problems solved by ≀10 agents

    • Targets real-world complexity requiring coordinated edits
    • Even leading agents achieve only 10% success rate

πŸ”¬ Methodology

Subsets were created through systematic analysis of solve distribution across 83 evaluated SWE-agents:

  • Problems solved by fewer agents provide better discrimination
  • Analysis covers submissions from October 2023 to May 2025
  • "Solved" means the agent's fix passed the verification test suite

πŸ“ˆ Key Insights

  • Enhanced Resolution: Frontier subset provides 6x better discrimination between top systems
  • Multi-file Complexity: Represents genuine software engineering challenges
  • Statistical Significance: Challenging subset offers robust evaluation with strong discrimination
  • Real Progress: Performance on these subsets indicates genuine capability advances

πŸ”— Resources

πŸš€ Usage

from datasets import load_dataset

# Load specific discriminative subset
frontier = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="frontier")
challenging = load_dataset("jatinganhotra/SWE-bench_Verified-discriminative", split="challenging")

Created by Jatin Ganhotra | Last Updated: June 19 2025