| ====================================================================== |
| PatchJudge Validation Report |
| ====================================================================== |
|
|
| 📊 Dataset: 160 examples |
|
|
| 📈 Score Distribution: |
| Mean: 22.8 |
| Median: 0.0 |
| Std: 26.9 |
|
|
| Score Distribution: |
| 0-10: ████████████████████████████████████████████████████████████████████████████████████████ (88) |
| 10-20: (0) |
| 20-30: ████████ (8) |
| 30-40: ██████ (6) |
| 40-50: ███████████████████████ (23) |
| 50-60: ███████████████ (15) |
| 60-70: ██████████████ (14) |
| 70-80: █████ (5) |
| 80-90: █ (1) |
| 90-100: (0) |
|
|
| 🎯 METR Alignment: |
| Test-passing patches below 50.0: 65.0% |
| ⚠️ Too harsh — scoring too many patches as not merge-worthy |
|
|
| 🔀 Resolved vs Unresolved Separation: |
| Mean score (resolved): 35.6 |
| Mean score (unresolved): 1.4 |
| Separation: +34.2 |
| Correlation: 1.000 |
|
|
| 🚨 Known-Bad Pattern Detection: |
| Detected: 50/50 (100.0%) |
| ✅ Good detection rate |
|
|
| 📐 Per-Dimension Scores: |
| correctness: mean=2.7 std=3.1 [0-9] |
| completeness: mean=2.0 std=2.3 [0-6] |
| code_quality: mean=2.4 std=2.9 [0-9] |
| non_regression_risk: mean=2.4 std=2.9 [0-9] |
| merge_readiness: mean=1.7 std=2.2 [0-8] |
|
|
| 🏴 Most Common Flags: |
| 9x partial_fix |
| 8x missing_edge_cases |
| 7x Limited edge case coverage |
| 6x not_production_ready |
| 6x missing_edge_case_handling |
| 5x Style violations |
| 5x minimal_test_coverage |
| 4x style_violations |
| 4x Fundamentally flawed approach |
| 4x Poor code quality |
|
|
| ⭐ Top 3 Patches: |
| 80.5 django__django-11066 (CoderForge-Qwen3-32B, PASS) |
| 75.0 django__django-11999 (CoderForge-Qwen3-32B, PASS) |
| 74.0 django__django-12143 (CoderForge-Qwen3-32B, PASS) |
|
|
| 💀 Bottom 3 Patches: |
| 0.0 pydata__xarray-3151 (OpenHands-O1-reasoning-high, FAIL) |
| 0.0 django__django-14792 (OpenHands-O1-reasoning-high, FAIL) |
| 0.0 django__django-11848 (OpenHands-O1-reasoning-high, FAIL) |
|
|
| ====================================================================== |