VD10 commited on
Commit
b7b0ad0
·
verified ·
1 Parent(s): 7c47838

Upload README.md with huggingface_hub

Browse files
Files changed (1) hide show
  1. README.md +38 -0
README.md CHANGED
@@ -89,6 +89,44 @@ Extractor (structured Aggregator
89
  - **126 synthetically generated known-bad** patches for validation
90
  - Features extracted for all examples
91
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
92
  ## Quick Start
93
 
94
  ```python
 
89
  - **126 synthetically generated known-bad** patches for validation
90
  - Features extracted for all examples
91
 
92
+ ## Evaluation Results (v1)
93
+
94
+ Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
95
+
96
+ ### Score Distribution
97
+ | Metric | Value |
98
+ |--------|-------|
99
+ | Mean MergeScore | **50.6/100** |
100
+ | Median MergeScore | **49.5/100** |
101
+ | Std Dev | 13.8 |
102
+ | Score range | 23.0 – 80.5 |
103
+
104
+ ### METR Alignment ✅
105
+ - **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
106
+ - Test-passing mean: 50.9, Test-failing mean: 42.5
107
+ - Clear separation between resolved and unresolved patches
108
+
109
+ ### Per-Dimension Averages (0-10 scale)
110
+ | Dimension | Mean | Std |
111
+ |-----------|------|-----|
112
+ | Correctness | 5.8 | 1.9 |
113
+ | Completeness | 4.3 | 1.3 |
114
+ | Code Quality | 5.1 | 1.8 |
115
+ | Non-Regression Risk | 5.2 | 1.8 |
116
+ | Merge-Readiness | 4.5 | 1.7 |
117
+
118
+ ### Per-Agent Comparison
119
+ | Agent | Mean MergeScore | Patches |
120
+ |-------|----------------|---------|
121
+ | CoderForge (Qwen3-32B) | 49.9 | 52 |
122
+ | OpenHands+O1 | 52.5 | 20 |
123
+
124
+ ### Known-Bad Detection
125
+ In earlier testing, the judge correctly identified known-bad patterns:
126
+ - **noop patch** (just adds `pass`): 18.5/100
127
+ - **broad try/except** patches: flagged as low quality
128
+ - **hardcoded returns**: flagged as non-genuine fixes
129
+
130
  ## Quick Start
131
 
132
  ```python