Upload README.md with huggingface_hub
Browse files
README.md
CHANGED
|
@@ -89,6 +89,44 @@ Extractor (structured Aggregator
|
|
| 89 |
- **126 synthetically generated known-bad** patches for validation
|
| 90 |
- Features extracted for all examples
|
| 91 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 92 |
## Quick Start
|
| 93 |
|
| 94 |
```python
|
|
|
|
| 89 |
- **126 synthetically generated known-bad** patches for validation
|
| 90 |
- Features extracted for all examples
|
| 91 |
|
| 92 |
+
## Evaluation Results (v1)
|
| 93 |
+
|
| 94 |
+
Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
|
| 95 |
+
|
| 96 |
+
### Score Distribution
|
| 97 |
+
| Metric | Value |
|
| 98 |
+
|--------|-------|
|
| 99 |
+
| Mean MergeScore | **50.6/100** |
|
| 100 |
+
| Median MergeScore | **49.5/100** |
|
| 101 |
+
| Std Dev | 13.8 |
|
| 102 |
+
| Score range | 23.0 – 80.5 |
|
| 103 |
+
|
| 104 |
+
### METR Alignment ✅
|
| 105 |
+
- **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
|
| 106 |
+
- Test-passing mean: 50.9, Test-failing mean: 42.5
|
| 107 |
+
- Clear separation between resolved and unresolved patches
|
| 108 |
+
|
| 109 |
+
### Per-Dimension Averages (0-10 scale)
|
| 110 |
+
| Dimension | Mean | Std |
|
| 111 |
+
|-----------|------|-----|
|
| 112 |
+
| Correctness | 5.8 | 1.9 |
|
| 113 |
+
| Completeness | 4.3 | 1.3 |
|
| 114 |
+
| Code Quality | 5.1 | 1.8 |
|
| 115 |
+
| Non-Regression Risk | 5.2 | 1.8 |
|
| 116 |
+
| Merge-Readiness | 4.5 | 1.7 |
|
| 117 |
+
|
| 118 |
+
### Per-Agent Comparison
|
| 119 |
+
| Agent | Mean MergeScore | Patches |
|
| 120 |
+
|-------|----------------|---------|
|
| 121 |
+
| CoderForge (Qwen3-32B) | 49.9 | 52 |
|
| 122 |
+
| OpenHands+O1 | 52.5 | 20 |
|
| 123 |
+
|
| 124 |
+
### Known-Bad Detection
|
| 125 |
+
In earlier testing, the judge correctly identified known-bad patterns:
|
| 126 |
+
- **noop patch** (just adds `pass`): 18.5/100
|
| 127 |
+
- **broad try/except** patches: flagged as low quality
|
| 128 |
+
- **hardcoded returns**: flagged as non-genuine fixes
|
| 129 |
+
|
| 130 |
## Quick Start
|
| 131 |
|
| 132 |
```python
|