Spaces:
Running
Running
add
Browse files
README.md
CHANGED
|
@@ -22,7 +22,7 @@ Currently, the leaderboard tracks public GitHub PR review activity across open-s
|
|
| 22 |
|
| 23 |
Most AI coding agent benchmarks rely on human-curated test suites and simulated environments. They're useful, but they don't tell you what happens when an agent participates in real code reviews with real maintainers and real quality standards.
|
| 24 |
|
| 25 |
-
This leaderboard flips that approach. Instead of synthetic tasks, we measure what matters: how many PRs did the agent review? What percentage of those reviews led to
|
| 26 |
|
| 27 |
If an agent can consistently provide valuable reviews that help maintainers accept quality PRs across different projects, that tells you something no benchmark can.
|
| 28 |
|
|
@@ -32,9 +32,9 @@ The leaderboard pulls data directly from GitHub's PR review history and shows yo
|
|
| 32 |
|
| 33 |
**Leaderboard Table**
|
| 34 |
- **Total Reviews**: How many PR reviews the agent has made in the last 6 months
|
| 35 |
-
- **
|
| 36 |
- **Rejected PRs**: How many PRs reviewed by the agent were rejected/closed without merging
|
| 37 |
-
- **Acceptance Rate**: Percentage of reviewed PRs that were
|
| 38 |
|
| 39 |
**Monthly Trends Visualization**
|
| 40 |
Beyond the table, we show interactive charts tracking how each agent's performance evolves month-by-month:
|
|
@@ -57,7 +57,7 @@ We search GitHub using the PR and review search APIs to track all reviews associ
|
|
| 57 |
|
| 58 |
**Review Outcome Tracking**
|
| 59 |
For each PR reviewed by an agent, we determine its status:
|
| 60 |
-
1. **
|
| 61 |
2. **Rejected**: PR was closed without being merged
|
| 62 |
3. **Pending**: PR is still open and under review
|
| 63 |
|
|
@@ -89,13 +89,13 @@ Click Submit. We'll validate the GitHub account, fetch the PR review history, an
|
|
| 89 |
|
| 90 |
## Understanding the Metrics
|
| 91 |
|
| 92 |
-
**Total Reviews vs
|
| 93 |
-
Not every PR will be
|
| 94 |
|
| 95 |
**Acceptance Rate**
|
| 96 |
-
This is the percentage of reviewed PRs that were ultimately
|
| 97 |
|
| 98 |
-
Acceptance Rate =
|
| 99 |
|
| 100 |
Note: Pending PRs (still open) are excluded from this calculation to ensure we only measure completed review outcomes.
|
| 101 |
|
|
|
|
| 22 |
|
| 23 |
Most AI coding agent benchmarks rely on human-curated test suites and simulated environments. They're useful, but they don't tell you what happens when an agent participates in real code reviews with real maintainers and real quality standards.
|
| 24 |
|
| 25 |
+
This leaderboard flips that approach. Instead of synthetic tasks, we measure what matters: how many PRs did the agent review? What percentage of those reviews led to merged PRs? What percentage were rejected? These are the signals that reflect genuine code review quality - the kind you'd expect from a human reviewer.
|
| 26 |
|
| 27 |
If an agent can consistently provide valuable reviews that help maintainers accept quality PRs across different projects, that tells you something no benchmark can.
|
| 28 |
|
|
|
|
| 32 |
|
| 33 |
**Leaderboard Table**
|
| 34 |
- **Total Reviews**: How many PR reviews the agent has made in the last 6 months
|
| 35 |
+
- **Merged PRs**: How many PRs reviewed by the agent were merged
|
| 36 |
- **Rejected PRs**: How many PRs reviewed by the agent were rejected/closed without merging
|
| 37 |
+
- **Acceptance Rate**: Percentage of reviewed PRs that were merged (see calculation details below)
|
| 38 |
|
| 39 |
**Monthly Trends Visualization**
|
| 40 |
Beyond the table, we show interactive charts tracking how each agent's performance evolves month-by-month:
|
|
|
|
| 57 |
|
| 58 |
**Review Outcome Tracking**
|
| 59 |
For each PR reviewed by an agent, we determine its status:
|
| 60 |
+
1. **Merged**: PR was merged into the repository
|
| 61 |
2. **Rejected**: PR was closed without being merged
|
| 62 |
3. **Pending**: PR is still open and under review
|
| 63 |
|
|
|
|
| 89 |
|
| 90 |
## Understanding the Metrics
|
| 91 |
|
| 92 |
+
**Total Reviews vs Merged/Rejected PRs**
|
| 93 |
+
Not every PR will be merged. PRs may be rejected due to bugs, insufficient quality, conflicts with project goals, or other reasons. The acceptance and rejection rates help you understand how effective an agent's reviews are at identifying quality contributions.
|
| 94 |
|
| 95 |
**Acceptance Rate**
|
| 96 |
+
This is the percentage of reviewed PRs that were ultimately merged, calculated as:
|
| 97 |
|
| 98 |
+
Acceptance Rate = Merged PRs ÷ (Merged PRs + Rejected PRs) × 100%
|
| 99 |
|
| 100 |
Note: Pending PRs (still open) are excluded from this calculation to ensure we only measure completed review outcomes.
|
| 101 |
|
app.py
CHANGED
|
@@ -53,8 +53,7 @@ LEADERBOARD_COLUMNS = [
|
|
| 53 |
("Agent Name", "string"),
|
| 54 |
("Website", "string"),
|
| 55 |
("Total Reviews", "number"),
|
| 56 |
-
("
|
| 57 |
-
("Rejected PRs", "number"),
|
| 58 |
("Acceptance Rate (%)", "number"),
|
| 59 |
]
|
| 60 |
|
|
@@ -451,10 +450,10 @@ def extract_review_metadata(pr):
|
|
| 451 |
|
| 452 |
PR status:
|
| 453 |
- pr_status: 'open', 'merged', or 'closed'
|
| 454 |
-
- pr_merged: True if PR was merged
|
| 455 |
- pr_closed_at: Date when PR was closed/merged (if applicable)
|
| 456 |
|
| 457 |
-
|
| 458 |
Rejected PR = PR that was closed without merging after agent review
|
| 459 |
"""
|
| 460 |
# Extract PR metadata from search results
|
|
@@ -721,16 +720,16 @@ def calculate_review_stats_from_metadata(metadata_list):
|
|
| 721 |
Returns a dictionary with comprehensive review metrics.
|
| 722 |
|
| 723 |
Acceptance Rate is calculated as:
|
| 724 |
-
|
| 725 |
|
| 726 |
-
|
| 727 |
Rejected PRs = PRs that were closed without merging (pr_status='closed')
|
| 728 |
Pending PRs = PRs still open (pr_status='open') - excluded from acceptance rate
|
| 729 |
"""
|
| 730 |
total_reviews = len(metadata_list)
|
| 731 |
|
| 732 |
-
# Count
|
| 733 |
-
|
| 734 |
if review_meta.get('pr_status') == 'merged')
|
| 735 |
|
| 736 |
# Count rejected PRs (closed without merging)
|
|
@@ -742,13 +741,12 @@ def calculate_review_stats_from_metadata(metadata_list):
|
|
| 742 |
if review_meta.get('pr_status') == 'open')
|
| 743 |
|
| 744 |
# Calculate acceptance rate (exclude pending PRs)
|
| 745 |
-
completed_prs =
|
| 746 |
-
acceptance_rate = (
|
| 747 |
|
| 748 |
return {
|
| 749 |
'total_reviews': total_reviews,
|
| 750 |
-
'
|
| 751 |
-
'rejected_prs': rejected_prs,
|
| 752 |
'pending_prs': pending_prs,
|
| 753 |
'acceptance_rate': round(acceptance_rate, 2),
|
| 754 |
}
|
|
@@ -767,8 +765,7 @@ def calculate_monthly_metrics_by_agent():
|
|
| 767 |
agent_name: {
|
| 768 |
'acceptance_rates': list of acceptance rates by month,
|
| 769 |
'total_reviews': list of review counts by month,
|
| 770 |
-
'
|
| 771 |
-
'rejected_prs': list of rejected PR counts by month
|
| 772 |
}
|
| 773 |
}
|
| 774 |
}
|
|
@@ -820,14 +817,13 @@ def calculate_monthly_metrics_by_agent():
|
|
| 820 |
for agent_name, month_dict in agent_month_data.items():
|
| 821 |
acceptance_rates = []
|
| 822 |
total_reviews_list = []
|
| 823 |
-
|
| 824 |
-
rejected_prs_list = []
|
| 825 |
|
| 826 |
for month in months:
|
| 827 |
reviews_in_month = month_dict.get(month, [])
|
| 828 |
|
| 829 |
-
# Count
|
| 830 |
-
|
| 831 |
if review.get('pr_status') == 'merged')
|
| 832 |
|
| 833 |
# Count rejected PRs (closed without merging)
|
|
@@ -838,19 +834,17 @@ def calculate_monthly_metrics_by_agent():
|
|
| 838 |
total_count = len(reviews_in_month)
|
| 839 |
|
| 840 |
# Calculate acceptance rate (exclude pending PRs)
|
| 841 |
-
completed_count =
|
| 842 |
-
acceptance_rate = (
|
| 843 |
|
| 844 |
acceptance_rates.append(acceptance_rate)
|
| 845 |
total_reviews_list.append(total_count)
|
| 846 |
-
|
| 847 |
-
rejected_prs_list.append(rejected_count)
|
| 848 |
|
| 849 |
result_data[agent_name] = {
|
| 850 |
'acceptance_rates': acceptance_rates,
|
| 851 |
'total_reviews': total_reviews_list,
|
| 852 |
-
'
|
| 853 |
-
'rejected_prs': rejected_prs_list
|
| 854 |
}
|
| 855 |
|
| 856 |
return {
|
|
@@ -1861,8 +1855,7 @@ def get_leaderboard_dataframe():
|
|
| 1861 |
data.get('agent_name', 'Unknown'),
|
| 1862 |
data.get('website', 'N/A'),
|
| 1863 |
data.get('total_reviews', 0),
|
| 1864 |
-
data.get('
|
| 1865 |
-
data.get('rejected_prs', 0),
|
| 1866 |
data.get('acceptance_rate', 0.0),
|
| 1867 |
])
|
| 1868 |
|
|
@@ -1871,7 +1864,7 @@ def get_leaderboard_dataframe():
|
|
| 1871 |
df = pd.DataFrame(rows, columns=column_names)
|
| 1872 |
|
| 1873 |
# Ensure numeric types
|
| 1874 |
-
numeric_cols = ["Total Reviews", "
|
| 1875 |
for col in numeric_cols:
|
| 1876 |
if col in df.columns:
|
| 1877 |
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
|
|
|
|
| 53 |
("Agent Name", "string"),
|
| 54 |
("Website", "string"),
|
| 55 |
("Total Reviews", "number"),
|
| 56 |
+
("Merged PRs", "number"),
|
|
|
|
| 57 |
("Acceptance Rate (%)", "number"),
|
| 58 |
]
|
| 59 |
|
|
|
|
| 450 |
|
| 451 |
PR status:
|
| 452 |
- pr_status: 'open', 'merged', or 'closed'
|
| 453 |
+
- pr_merged: True if PR was merged, False otherwise
|
| 454 |
- pr_closed_at: Date when PR was closed/merged (if applicable)
|
| 455 |
|
| 456 |
+
Merged PR = PR that was merged after agent review
|
| 457 |
Rejected PR = PR that was closed without merging after agent review
|
| 458 |
"""
|
| 459 |
# Extract PR metadata from search results
|
|
|
|
| 720 |
Returns a dictionary with comprehensive review metrics.
|
| 721 |
|
| 722 |
Acceptance Rate is calculated as:
|
| 723 |
+
merged PRs / (merged PRs + rejected PRs) * 100
|
| 724 |
|
| 725 |
+
Merged PRs = PRs that were merged (pr_status='merged')
|
| 726 |
Rejected PRs = PRs that were closed without merging (pr_status='closed')
|
| 727 |
Pending PRs = PRs still open (pr_status='open') - excluded from acceptance rate
|
| 728 |
"""
|
| 729 |
total_reviews = len(metadata_list)
|
| 730 |
|
| 731 |
+
# Count merged PRs (merged)
|
| 732 |
+
merged_prs = sum(1 for review_meta in metadata_list
|
| 733 |
if review_meta.get('pr_status') == 'merged')
|
| 734 |
|
| 735 |
# Count rejected PRs (closed without merging)
|
|
|
|
| 741 |
if review_meta.get('pr_status') == 'open')
|
| 742 |
|
| 743 |
# Calculate acceptance rate (exclude pending PRs)
|
| 744 |
+
completed_prs = merged_prs + rejected_prs
|
| 745 |
+
acceptance_rate = (merged_prs / completed_prs * 100) if completed_prs > 0 else 0
|
| 746 |
|
| 747 |
return {
|
| 748 |
'total_reviews': total_reviews,
|
| 749 |
+
'merged_prs': merged_prs,
|
|
|
|
| 750 |
'pending_prs': pending_prs,
|
| 751 |
'acceptance_rate': round(acceptance_rate, 2),
|
| 752 |
}
|
|
|
|
| 765 |
agent_name: {
|
| 766 |
'acceptance_rates': list of acceptance rates by month,
|
| 767 |
'total_reviews': list of review counts by month,
|
| 768 |
+
'merged_prs': list of merged PR counts by month,
|
|
|
|
| 769 |
}
|
| 770 |
}
|
| 771 |
}
|
|
|
|
| 817 |
for agent_name, month_dict in agent_month_data.items():
|
| 818 |
acceptance_rates = []
|
| 819 |
total_reviews_list = []
|
| 820 |
+
merged_prs_list = []
|
|
|
|
| 821 |
|
| 822 |
for month in months:
|
| 823 |
reviews_in_month = month_dict.get(month, [])
|
| 824 |
|
| 825 |
+
# Count merged PRs (merged)
|
| 826 |
+
merged_count = sum(1 for review in reviews_in_month
|
| 827 |
if review.get('pr_status') == 'merged')
|
| 828 |
|
| 829 |
# Count rejected PRs (closed without merging)
|
|
|
|
| 834 |
total_count = len(reviews_in_month)
|
| 835 |
|
| 836 |
# Calculate acceptance rate (exclude pending PRs)
|
| 837 |
+
completed_count = merged_count + rejected_count
|
| 838 |
+
acceptance_rate = (merged_count / completed_count * 100) if completed_count > 0 else None
|
| 839 |
|
| 840 |
acceptance_rates.append(acceptance_rate)
|
| 841 |
total_reviews_list.append(total_count)
|
| 842 |
+
merged_prs_list.append(merged_count)
|
|
|
|
| 843 |
|
| 844 |
result_data[agent_name] = {
|
| 845 |
'acceptance_rates': acceptance_rates,
|
| 846 |
'total_reviews': total_reviews_list,
|
| 847 |
+
'merged_prs': merged_prs_list,
|
|
|
|
| 848 |
}
|
| 849 |
|
| 850 |
return {
|
|
|
|
| 1855 |
data.get('agent_name', 'Unknown'),
|
| 1856 |
data.get('website', 'N/A'),
|
| 1857 |
data.get('total_reviews', 0),
|
| 1858 |
+
data.get('merged_prs', 0),
|
|
|
|
| 1859 |
data.get('acceptance_rate', 0.0),
|
| 1860 |
])
|
| 1861 |
|
|
|
|
| 1864 |
df = pd.DataFrame(rows, columns=column_names)
|
| 1865 |
|
| 1866 |
# Ensure numeric types
|
| 1867 |
+
numeric_cols = ["Total Reviews", "Merged PRs", "Acceptance Rate (%)"]
|
| 1868 |
for col in numeric_cols:
|
| 1869 |
if col in df.columns:
|
| 1870 |
df[col] = pd.to_numeric(df[col], errors='coerce').fillna(0)
|
msr.py
CHANGED
|
@@ -449,10 +449,10 @@ def extract_review_metadata(pr):
|
|
| 449 |
|
| 450 |
PR status:
|
| 451 |
- pr_status: 'open', 'merged', or 'closed'
|
| 452 |
-
- pr_merged: True if PR was merged
|
| 453 |
- pr_closed_at: Date when PR was closed/merged (if applicable)
|
| 454 |
|
| 455 |
-
|
| 456 |
Rejected PR = PR that was closed without merging after agent review
|
| 457 |
"""
|
| 458 |
# Extract PR metadata from search results
|
|
@@ -1050,16 +1050,16 @@ def calculate_review_stats_from_metadata(metadata_list):
|
|
| 1050 |
Returns a dictionary with comprehensive review metrics.
|
| 1051 |
|
| 1052 |
Acceptance Rate is calculated as:
|
| 1053 |
-
|
| 1054 |
|
| 1055 |
-
|
| 1056 |
Rejected PRs = PRs that were closed without merging (pr_status='closed')
|
| 1057 |
Pending PRs = PRs still open (pr_status='open') - excluded from acceptance rate
|
| 1058 |
"""
|
| 1059 |
total_reviews = len(metadata_list)
|
| 1060 |
|
| 1061 |
-
# Count
|
| 1062 |
-
|
| 1063 |
if review_meta.get('pr_status') == 'merged')
|
| 1064 |
|
| 1065 |
# Count rejected PRs (closed without merging)
|
|
@@ -1071,12 +1071,12 @@ def calculate_review_stats_from_metadata(metadata_list):
|
|
| 1071 |
if review_meta.get('pr_status') == 'open')
|
| 1072 |
|
| 1073 |
# Calculate acceptance rate (exclude pending PRs)
|
| 1074 |
-
completed_prs =
|
| 1075 |
-
acceptance_rate = (
|
| 1076 |
|
| 1077 |
return {
|
| 1078 |
'total_reviews': total_reviews,
|
| 1079 |
-
'
|
| 1080 |
'rejected_prs': rejected_prs,
|
| 1081 |
'pending_prs': pending_prs,
|
| 1082 |
'acceptance_rate': round(acceptance_rate, 2),
|
|
|
|
| 449 |
|
| 450 |
PR status:
|
| 451 |
- pr_status: 'open', 'merged', or 'closed'
|
| 452 |
+
- pr_merged: True if PR was merged, False otherwise
|
| 453 |
- pr_closed_at: Date when PR was closed/merged (if applicable)
|
| 454 |
|
| 455 |
+
merged PR = PR that was merged after agent review
|
| 456 |
Rejected PR = PR that was closed without merging after agent review
|
| 457 |
"""
|
| 458 |
# Extract PR metadata from search results
|
|
|
|
| 1050 |
Returns a dictionary with comprehensive review metrics.
|
| 1051 |
|
| 1052 |
Acceptance Rate is calculated as:
|
| 1053 |
+
merged PRs / (merged PRs + rejected PRs) * 100
|
| 1054 |
|
| 1055 |
+
merged PRs = PRs that were merged (pr_status='merged')
|
| 1056 |
Rejected PRs = PRs that were closed without merging (pr_status='closed')
|
| 1057 |
Pending PRs = PRs still open (pr_status='open') - excluded from acceptance rate
|
| 1058 |
"""
|
| 1059 |
total_reviews = len(metadata_list)
|
| 1060 |
|
| 1061 |
+
# Count merged PRs (merged)
|
| 1062 |
+
merged_prs = sum(1 for review_meta in metadata_list
|
| 1063 |
if review_meta.get('pr_status') == 'merged')
|
| 1064 |
|
| 1065 |
# Count rejected PRs (closed without merging)
|
|
|
|
| 1071 |
if review_meta.get('pr_status') == 'open')
|
| 1072 |
|
| 1073 |
# Calculate acceptance rate (exclude pending PRs)
|
| 1074 |
+
completed_prs = merged_prs + rejected_prs
|
| 1075 |
+
acceptance_rate = (merged_prs / completed_prs * 100) if completed_prs > 0 else 0
|
| 1076 |
|
| 1077 |
return {
|
| 1078 |
'total_reviews': total_reviews,
|
| 1079 |
+
'merged_prs': merged_prs,
|
| 1080 |
'rejected_prs': rejected_prs,
|
| 1081 |
'pending_prs': pending_prs,
|
| 1082 |
'acceptance_rate': round(acceptance_rate, 2),
|