Spaces:
Sleeping
Sleeping
Commit Β·
bd428dc
1
Parent(s): 9e79ae0
Update docs with latest HF Native and OpenRouter benchmark scores
Browse files- ARCHITECTURE_BLUEPRINT.md +17 -0
- README.md +26 -12
- code-review-env/README.md +19 -0
ARCHITECTURE_BLUEPRINT.md
CHANGED
|
@@ -225,6 +225,23 @@ Features:
|
|
| 225 |
- **Rate limit cooling**: 15-second pause between models to respect API quotas
|
| 226 |
- **Timeout protection**: 300-second subprocess timeout per model run
|
| 227 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 228 |
---
|
| 229 |
|
| 230 |
## 8. Testing Infrastructure
|
|
|
|
| 225 |
- **Rate limit cooling**: 15-second pause between models to respect API quotas
|
| 226 |
- **Timeout protection**: 300-second subprocess timeout per model run
|
| 227 |
|
| 228 |
+
### π Benchmark Results Validation (Latest)
|
| 229 |
+
|
| 230 |
+
**Hugging Face Native (Serverless Production)**
|
| 231 |
+
| Model | Environment | Fast F1 | Env F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 232 |
+
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 233 |
+
| `deepseek-ai/DeepSeek-V3` | β¨ HuggingFace | 0.667 | **0.999** | 0.564 | **0.743** | 97% |
|
| 234 |
+
| `Qwen/Qwen2.5-72B-Instruct` | β¨ HuggingFace | 0.200 | 0.588 | 0.286 | **0.358** | 95% |
|
| 235 |
+
| `meta-llama/Meta-Llama-3-8B-Instruct` | β¨ HuggingFace | 0.429 | 0.001 | 0.001 | **0.144** | 96% |
|
| 236 |
+
|
| 237 |
+
**OpenRouter (Stress Test Verification)**
|
| 238 |
+
| Model | Environment | Fast F1 | Env F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 239 |
+
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 240 |
+
| `deepseek-ai/DeepSeek-V3` | π OpenRouter | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
|
| 241 |
+
| `openai/gpt-4o-mini` | π OpenRouter | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
|
| 242 |
+
| `meta-llama/llama-3.3-70b-instruct` | π OpenRouter | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 243 |
+
| `qwen/qwen-2.5-72b-instruct` | π OpenRouter | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 244 |
+
|
| 245 |
---
|
| 246 |
|
| 247 |
## 8. Testing Infrastructure
|
README.md
CHANGED
|
@@ -92,24 +92,38 @@ All scores deterministic and reproducible.
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
-
## Baseline Scores (
|
| 96 |
|
| 97 |
Includes Telemetric Confidence Scoring.
|
| 98 |
|
| 99 |
-
|
| 100 |
-
|
| 101 |
-
|
| 102 |
-
|
|
| 103 |
-
|
|
| 104 |
-
|
|
| 105 |
-
| **
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 106 |
|
| 107 |
**Key findings:**
|
| 108 |
-
- No model achieves 0.999 on hard tasks β the environment genuinely challenges frontier models
|
| 109 |
-
- False positives are heavily mathematically penalized
|
| 110 |
- DeepSeek scored highest overall by self-reporting the most accurate high-confidence answers.
|
| 111 |
-
- Llama-3 proudly hallucinated
|
| 112 |
-
- See `
|
| 113 |
|
| 114 |
See [`FINDINGS_PAPER.md`](./FINDINGS_PAPER.md) for full analysis.
|
| 115 |
|
|
|
|
| 92 |
|
| 93 |
---
|
| 94 |
|
| 95 |
+
## Baseline Scores (Latest Results)
|
| 96 |
|
| 97 |
Includes Telemetric Confidence Scoring.
|
| 98 |
|
| 99 |
+
### π HUGGING FACE NATIVE SERVERLESS (Final Production Phase)
|
| 100 |
+
Native inference parsing successfully verified directly over `https://router.huggingface.co/v1`. DeepSeek-V3 completely dominated the native test group, surgically identifying every web vulnerability in the medium test environment to achieve a mathematically perfect 0.999 limit ceiling.
|
| 101 |
+
|
| 102 |
+
| Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 103 |
+
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 104 |
+
| `deepseek-ai/DeepSeek-V3` | β¨ **HuggingFace** | 0.667 | **0.999** | 0.564 | **0.743** | 97% |
|
| 105 |
+
| `Qwen/Qwen2.5-72B-Instruct` | β¨ **HuggingFace** | 0.200 | 0.588 | 0.286 | **0.358** | 95% |
|
| 106 |
+
| `meta-llama/Meta-Llama-3-8B-Instruct` | β¨ **HuggingFace** | 0.429 | 0.001 | 0.001 | **0.144** | 96% |
|
| 107 |
+
| `meta-llama/Llama-3.3-70B-Instruct` | β Rate Limited | - | - | - | **-** | - |
|
| 108 |
+
| `mistralai/Mixtral-8x7B-Instruct-v0.1` | β Model Unsupported | - | - | - | **-** | - |
|
| 109 |
+
|
| 110 |
+
### π POST-SUBMISSION OPENROUTER BENCHMARKS
|
| 111 |
+
Final stress test verification leveraging OpenRouter API failover.
|
| 112 |
+
|
| 113 |
+
| Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 114 |
+
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 115 |
+
| `deepseek-ai/DeepSeek-V3` | π **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
|
| 116 |
+
| `openai/gpt-4o-mini` | π **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
|
| 117 |
+
| `meta-llama/llama-3.3-70b-instruct` | π **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 118 |
+
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 119 |
+
| `mistralai/mistral-small-3.1-24b` | π **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
| 120 |
|
| 121 |
**Key findings:**
|
| 122 |
+
- No model achieves 0.999 consistently on hard tasks β the environment genuinely challenges frontier models
|
| 123 |
+
- False positives are heavily mathematically penalized.
|
| 124 |
- DeepSeek scored highest overall by self-reporting the most accurate high-confidence answers.
|
| 125 |
+
- Llama-3 proudly hallucinated secure bugs with high confidence and was heavily mathematically penalized.
|
| 126 |
+
- See `benchmark_comparison.md` for our raw confidence metric breakdown.
|
| 127 |
|
| 128 |
See [`FINDINGS_PAPER.md`](./FINDINGS_PAPER.md) for full analysis.
|
| 129 |
|
code-review-env/README.md
CHANGED
|
@@ -35,6 +35,25 @@ tests/ # Pytest suite (70 tests)
|
|
| 35 |
|
| 36 |
Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks.
|
| 37 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 38 |
## Tests
|
| 39 |
|
| 40 |
```bash
|
|
|
|
| 35 |
|
| 36 |
Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks.
|
| 37 |
|
| 38 |
+
## Benchmark Results (Latest)
|
| 39 |
+
|
| 40 |
+
For a complete breakdown, refer to `benchmark_comparison.md` in the repository root.
|
| 41 |
+
|
| 42 |
+
**Hugging Face Native (Production Phase):**
|
| 43 |
+
| Model | Environment | Avg F1 | Avg Conf |
|
| 44 |
+
|---|---|---|---|
|
| 45 |
+
| `deepseek-ai/DeepSeek-V3` | β¨ HuggingFace | **0.743** | 97% |
|
| 46 |
+
| `Qwen/Qwen2.5-72B-Instruct` | β¨ HuggingFace | **0.358** | 95% |
|
| 47 |
+
| `meta-llama/Meta-Llama-3-8B-Instruct` | β¨ HuggingFace | **0.144** | 96% |
|
| 48 |
+
|
| 49 |
+
**OpenRouter (Final Validation):**
|
| 50 |
+
| Model | Environment | Avg F1 | Avg Conf |
|
| 51 |
+
|---|---|---|---|
|
| 52 |
+
| `deepseek-ai/DeepSeek-V3` | π OpenRouter | **0.712** | 92% |
|
| 53 |
+
| `openai/gpt-4o-mini` | π OpenRouter | **0.694** | 90% |
|
| 54 |
+
| `meta-llama/llama-3.3-70b-instruct` | π OpenRouter | **0.626** | 94% |
|
| 55 |
+
| `qwen/qwen-2.5-72b-instruct` | π OpenRouter | **0.619** | 97% |
|
| 56 |
+
|
| 57 |
## Tests
|
| 58 |
|
| 59 |
```bash
|