Spaces:
Sleeping
Sleeping
Commit Β·
40ab31f
1
Parent(s): bd428dc
Add detailed model performance reasoning across all benchmark documentation
Browse files- ARCHITECTURE_BLUEPRINT.md +8 -0
- README.md +7 -6
- benchmark_comparison.md +10 -0
- code-review-env/README.md +6 -0
ARCHITECTURE_BLUEPRINT.md
CHANGED
|
@@ -242,6 +242,14 @@ Features:
|
|
| 242 |
| `meta-llama/llama-3.3-70b-instruct` | π OpenRouter | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 243 |
| `qwen/qwen-2.5-72b-instruct` | π OpenRouter | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 244 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 245 |
---
|
| 246 |
|
| 247 |
## 8. Testing Infrastructure
|
|
|
|
| 242 |
| `meta-llama/llama-3.3-70b-instruct` | π OpenRouter | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 243 |
| `qwen/qwen-2.5-72b-instruct` | π OpenRouter | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 244 |
|
| 245 |
+
### π§ Performance Analysis: Why Models Succeed or Fail
|
| 246 |
+
Our deterministic grading environment captures architectural strengths and weaknesses not visible in standard multiple-choice tests:
|
| 247 |
+
|
| 248 |
+
- π₯ **DeepSeek-V3:** Dominated because of superior **confidence calibration** and **semantic reasoning**. When faced with the adversarial "Red Herring" (`try...except: pass` inside a backoff loop), its confidence correctly evaluates below 80%, allowing it to bypass the trap without severe penalty. It correctly uses multi-step logic to deduce *why* code is conceptually flawed (Semantic 'Why' Metric), ensuring it gets full F1 credit.
|
| 249 |
+
- π₯ **Qwen-2.5-72B:** Highly capable at identifying localized syntax/security errors in the Easy and Medium environments. However, it suffered in the Hard task due to **limitations in long-context, cross-file repository reasoning**. It failed to accurately trace `_KEY_MATERIAL` usage across distinct interdependent python files.
|
| 250 |
+
- π₯ **Llama-3.3-70B:** Suffered mathematically due to **overconfidence syndrome**. The environment heavily penalizes false positives submitted with `>80%` confidence. Llama consistently flagged secure, valid code lines as "Critical Vulnerabilities" with `95%`+ confidence, plummeting its F1 score mathematically. It often fell for the adversarial comment injections.
|
| 251 |
+
- π **Smaller/Local Models:** Failed primarily due to **JSON schema decomposition** (outputting conversational text instead of strict operations) or reaching token boundaries during extraction.
|
| 252 |
+
|
| 253 |
---
|
| 254 |
|
| 255 |
## 8. Testing Infrastructure
|
README.md
CHANGED
|
@@ -118,12 +118,13 @@ Final stress test verification leveraging OpenRouter API failover.
|
|
| 118 |
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 119 |
| `mistralai/mistral-small-3.1-24b` | π **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
| 120 |
|
| 121 |
-
|
| 122 |
-
|
| 123 |
-
|
| 124 |
-
- DeepSeek
|
| 125 |
-
-
|
| 126 |
-
-
|
|
|
|
| 127 |
|
| 128 |
See [`FINDINGS_PAPER.md`](./FINDINGS_PAPER.md) for full analysis.
|
| 129 |
|
|
|
|
| 118 |
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 119 |
| `mistralai/mistral-small-3.1-24b` | π **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
| 120 |
|
| 121 |
+
### π§ Performance Analysis: Why Models Succeed or Fail
|
| 122 |
+
Our deterministic grading environment reveals deep behaviors not captured by standard multiple-choice benchmarks:
|
| 123 |
+
|
| 124 |
+
- π₯ **DeepSeek-V3 (The Winner):** Dominated because of superior **confidence calibration** and **semantic reasoning**. Unlike other models, DeepSeek doesn't just guess. When faced with the adversarial "Red Herring" (`try...except: pass` inside a backoff loop), its confidence drops, allowing it to bypass the trap entirely. It correctly uses multi-step logic to deduce *why* code is conceptually flawed rather than just syntactically incorrect.
|
| 125 |
+
- π₯ **Qwen-2.5-72B:** Highly capable at identifying localized syntax and logic errors in the Easy and Medium environments. However, it suffered in the Hard task, demonstrating **limitations in long-context, cross-file reasoning**. It often failed to correctly track how keys generated in `config_loader.py` were insecurely consumed in `crypto_service.py`.
|
| 126 |
+
- π₯ **Llama-3.3-70B (The Overconfident Guesser):** Suffered mathematically due to **overconfidence syndrome**. The environment heavily penalizes false positives submitted with `>80%` confidence. Llama consistently flagged totally secure, verified code blocks as "Critical Vulnerabilities" with `95%` confidence, causing its F1 score to crash dynamically. It could not differentiate real bugs from the adversarial comment injections.
|
| 127 |
+
- π **Smaller/Local Models (Mixtral, Meta-Llama-8B, Gemma):** Generally failed either due to **JSON parsing collapse** (outputting conversational text or reasoning tags instead of strict operation schemas) or by reaching maximum timeout limits when scanning larger codeblocks.
|
| 128 |
|
| 129 |
See [`FINDINGS_PAPER.md`](./FINDINGS_PAPER.md) for full analysis.
|
| 130 |
|
benchmark_comparison.md
CHANGED
|
@@ -48,3 +48,13 @@ Throughout the ascending environments, score clamping was mathematically refined
|
|
| 48 |
| `meta-llama/llama-3.3-70b-instruct` | π **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 49 |
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 50 |
| `mistralai/mistral-small-3.1-24b` | π **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
| `meta-llama/llama-3.3-70b-instruct` | π **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 49 |
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 50 |
| `mistralai/mistral-small-3.1-24b` | π **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
| 51 |
+
|
| 52 |
+
<br>
|
| 53 |
+
|
| 54 |
+
### π§ Performance Analysis: Why Models Succeed or Fail
|
| 55 |
+
Our deterministic grading environment reveals deep behaviors not captured by standard multiple-choice benchmarks:
|
| 56 |
+
|
| 57 |
+
- π₯ **DeepSeek-V3 (The Winner):** Dominated because of superior **confidence calibration** and **semantic reasoning**. Unlike other models, DeepSeek doesn't just guess. When faced with the adversarial "Red Herring" (`try...except: pass` inside a backoff loop), its confidence drops, allowing it to bypass the trap entirely. It correctly uses multi-step logic to deduce *why* code is conceptually flawed rather than just syntactically incorrect.
|
| 58 |
+
- π₯ **Qwen-2.5-72B:** Highly capable at identifying localized syntax and logic errors in the Easy and Medium environments. However, it suffered in the Hard task, demonstrating **limitations in long-context, cross-file reasoning**. It often failed to correctly track how keys generated in `config_loader.py` were insecurely consumed in `crypto_service.py`.
|
| 59 |
+
- π₯ **Llama-3.3-70B (The Overconfident Guesser):** Suffered mathematically due to **overconfidence syndrome**. The environment heavily penalizes false positives submitted with `>80%` confidence. Llama consistently flagged totally secure, verified code blocks as "Critical Vulnerabilities" with `95%` confidence, causing its F1 score to crash dynamically. It could not differentiate real bugs from the adversarial comment injections.
|
| 60 |
+
- π **Smaller/Local Models (Mixtral, Meta-Llama-8B, Gemma):** Generally failed either due to **JSON parsing collapse** (outputting conversational text or reasoning tags instead of strict operation schemas) or by reaching maximum timeout limits when scanning larger codeblocks.
|
code-review-env/README.md
CHANGED
|
@@ -54,6 +54,12 @@ For a complete breakdown, refer to `benchmark_comparison.md` in the repository r
|
|
| 54 |
| `meta-llama/llama-3.3-70b-instruct` | π OpenRouter | **0.626** | 94% |
|
| 55 |
| `qwen/qwen-2.5-72b-instruct` | π OpenRouter | **0.619** | 97% |
|
| 56 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 57 |
## Tests
|
| 58 |
|
| 59 |
```bash
|
|
|
|
| 54 |
| `meta-llama/llama-3.3-70b-instruct` | π OpenRouter | **0.626** | 94% |
|
| 55 |
| `qwen/qwen-2.5-72b-instruct` | π OpenRouter | **0.619** | 97% |
|
| 56 |
|
| 57 |
+
### π§ Performance Analysis: Why Models Succeed or Fail
|
| 58 |
+
- **DeepSeek-V3:** Excels due to perfect **confidence calibration**. It ignores adversarial red herring traps and accurately links vulnerabilities across multiple files.
|
| 59 |
+
- **Qwen-2.5-72B:** Strong at localized syntax checking but weak at long-context, cross-file reasoning tracking variables between modules.
|
| 60 |
+
- **Llama-3.3-70B:** Severely punished by the F1 grader for "overconfidence syndrome"βguessing wildly at false-positives with `95%` certainty.
|
| 61 |
+
- **Small Models:** Primarily fail due to JSON parsing collapse or timeout limits while analyzing heavy source files.
|
| 62 |
+
|
| 63 |
## Tests
|
| 64 |
|
| 65 |
```bash
|