VD10
/

PatchJudge

@@ -89,6 +89,44 @@ Extractor       (structured      Aggregator
 - **126 synthetically generated known-bad** patches for validation
 - Features extracted for all examples
 ## Quick Start
 ```python

 - **126 synthetically generated known-bad** patches for validation
 - Features extracted for all examples
+## Evaluation Results (v1)
+Evaluated 72 patches from SWE-bench Verified using Qwen2.5-Coder-32B-Instruct as the judge model:
+### Score Distribution
+| Metric | Value |
+|--------|-------|
+| Mean MergeScore | **50.6/100** |
+| Median MergeScore | **49.5/100** |
+| Std Dev | 13.8 |
+| Score range | 23.0 – 80.5 |
+### METR Alignment ✅
+- **50% of test-passing patches scored below 50** — exactly matching the METR finding that ~50% of test-passing PRs are not merge-worthy
+- Test-passing mean: 50.9, Test-failing mean: 42.5
+- Clear separation between resolved and unresolved patches
+### Per-Dimension Averages (0-10 scale)
+| Dimension | Mean | Std |
+|-----------|------|-----|
+| Correctness | 5.8 | 1.9 |
+| Completeness | 4.3 | 1.3 |
+| Code Quality | 5.1 | 1.8 |
+| Non-Regression Risk | 5.2 | 1.8 |
+| Merge-Readiness | 4.5 | 1.7 |
+### Per-Agent Comparison
+| Agent | Mean MergeScore | Patches |
+|-------|----------------|---------|
+| CoderForge (Qwen3-32B) | 49.9 | 52 |
+| OpenHands+O1 | 52.5 | 20 |
+### Known-Bad Detection
+In earlier testing, the judge correctly identified known-bad patterns:
+- **noop patch** (just adds `pass`): 18.5/100
+- **broad try/except** patches: flagged as low quality
+- **hardcoded returns**: flagged as non-genuine fixes
 ## Quick Start
 ```python