DeepParmar commited on
Commit
bd428dc
Β·
1 Parent(s): 9e79ae0

Update docs with latest HF Native and OpenRouter benchmark scores

Browse files
Files changed (3) hide show
  1. ARCHITECTURE_BLUEPRINT.md +17 -0
  2. README.md +26 -12
  3. code-review-env/README.md +19 -0
ARCHITECTURE_BLUEPRINT.md CHANGED
@@ -225,6 +225,23 @@ Features:
225
  - **Rate limit cooling**: 15-second pause between models to respect API quotas
226
  - **Timeout protection**: 300-second subprocess timeout per model run
227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
  ---
229
 
230
  ## 8. Testing Infrastructure
 
225
  - **Rate limit cooling**: 15-second pause between models to respect API quotas
226
  - **Timeout protection**: 300-second subprocess timeout per model run
227
 
228
+ ### πŸ† Benchmark Results Validation (Latest)
229
+
230
+ **Hugging Face Native (Serverless Production)**
231
+ | Model | Environment | Fast F1 | Env F1 | Hard F1 | **Avg F1** | Avg Conf. |
232
+ | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
233
+ | `deepseek-ai/DeepSeek-V3` | ✨ HuggingFace | 0.667 | **0.999** | 0.564 | **0.743** | 97% |
234
+ | `Qwen/Qwen2.5-72B-Instruct` | ✨ HuggingFace | 0.200 | 0.588 | 0.286 | **0.358** | 95% |
235
+ | `meta-llama/Meta-Llama-3-8B-Instruct` | ✨ HuggingFace | 0.429 | 0.001 | 0.001 | **0.144** | 96% |
236
+
237
+ **OpenRouter (Stress Test Verification)**
238
+ | Model | Environment | Fast F1 | Env F1 | Hard F1 | **Avg F1** | Avg Conf. |
239
+ | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
240
+ | `deepseek-ai/DeepSeek-V3` | πŸš€ OpenRouter | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
241
+ | `openai/gpt-4o-mini` | πŸš€ OpenRouter | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
242
+ | `meta-llama/llama-3.3-70b-instruct` | πŸš€ OpenRouter | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
243
+ | `qwen/qwen-2.5-72b-instruct` | πŸš€ OpenRouter | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
244
+
245
  ---
246
 
247
  ## 8. Testing Infrastructure
README.md CHANGED
@@ -92,24 +92,38 @@ All scores deterministic and reproducible.
92
 
93
  ---
94
 
95
- ## Baseline Scores (5 Frontier Models)
96
 
97
  Includes Telemetric Confidence Scoring.
98
 
99
- | Model | Easy | Medium | Hard | Avg | Verdict |
100
- |-------|:----:|:------:|:----:|:---:|---------|
101
- | **DeepSeek-Chat** | 0.999 | 0.667 | 0.800 | **0.822** | Surgically precise, perfectly calibrated |
102
- | **Qwen-2.5-72B** | 0.727 | 0.824 | 0.500 | 0.684 | Solid answers, small hallucination rate |
103
- | **GPT-4o-Mini** | 0.999 | 0.588 | 0.323 | 0.637 | Crumbles on hard tasks |
104
- | **Llama-3.3-70B** | 0.556 | 0.625 | 0.375 | 0.519 | Dangerously overconfident |
105
- | **Mistral-Small** | 0.308 | 0.333 | 0.295 | 0.312 | Hit 34k token limit and crashed safely |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
106
 
107
  **Key findings:**
108
- - No model achieves 0.999 on hard tasks β€” the environment genuinely challenges frontier models
109
- - False positives are heavily mathematically penalized
110
  - DeepSeek scored highest overall by self-reporting the most accurate high-confidence answers.
111
- - Llama-3 proudly hallucinated 19 completely secure bugs with "90% confidence" and was heavily mathematically penalized.
112
- - See `latest-bench.md` for our raw confidence metric breakdown.
113
 
114
  See [`FINDINGS_PAPER.md`](./FINDINGS_PAPER.md) for full analysis.
115
 
 
92
 
93
  ---
94
 
95
+ ## Baseline Scores (Latest Results)
96
 
97
  Includes Telemetric Confidence Scoring.
98
 
99
+ ### πŸ† HUGGING FACE NATIVE SERVERLESS (Final Production Phase)
100
+ Native inference parsing successfully verified directly over `https://router.huggingface.co/v1`. DeepSeek-V3 completely dominated the native test group, surgically identifying every web vulnerability in the medium test environment to achieve a mathematically perfect 0.999 limit ceiling.
101
+
102
+ | Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
103
+ | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
104
+ | `deepseek-ai/DeepSeek-V3` | ✨ **HuggingFace** | 0.667 | **0.999** | 0.564 | **0.743** | 97% |
105
+ | `Qwen/Qwen2.5-72B-Instruct` | ✨ **HuggingFace** | 0.200 | 0.588 | 0.286 | **0.358** | 95% |
106
+ | `meta-llama/Meta-Llama-3-8B-Instruct` | ✨ **HuggingFace** | 0.429 | 0.001 | 0.001 | **0.144** | 96% |
107
+ | `meta-llama/Llama-3.3-70B-Instruct` | ❌ Rate Limited | - | - | - | **-** | - |
108
+ | `mistralai/Mixtral-8x7B-Instruct-v0.1` | ❌ Model Unsupported | - | - | - | **-** | - |
109
+
110
+ ### 🌐 POST-SUBMISSION OPENROUTER BENCHMARKS
111
+ Final stress test verification leveraging OpenRouter API failover.
112
+
113
+ | Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
114
+ | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
115
+ | `deepseek-ai/DeepSeek-V3` | πŸš€ **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
116
+ | `openai/gpt-4o-mini` | πŸš€ **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
117
+ | `meta-llama/llama-3.3-70b-instruct` | πŸš€ **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
118
+ | `qwen/qwen-2.5-72b-instruct` | πŸš€ **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
119
+ | `mistralai/mistral-small-3.1-24b` | πŸš€ **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
120
 
121
  **Key findings:**
122
+ - No model achieves 0.999 consistently on hard tasks β€” the environment genuinely challenges frontier models
123
+ - False positives are heavily mathematically penalized.
124
  - DeepSeek scored highest overall by self-reporting the most accurate high-confidence answers.
125
+ - Llama-3 proudly hallucinated secure bugs with high confidence and was heavily mathematically penalized.
126
+ - See `benchmark_comparison.md` for our raw confidence metric breakdown.
127
 
128
  See [`FINDINGS_PAPER.md`](./FINDINGS_PAPER.md) for full analysis.
129
 
code-review-env/README.md CHANGED
@@ -35,6 +35,25 @@ tests/ # Pytest suite (70 tests)
35
 
36
  Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks.
37
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
38
  ## Tests
39
 
40
  ```bash
 
35
 
36
  Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks.
37
 
38
+ ## Benchmark Results (Latest)
39
+
40
+ For a complete breakdown, refer to `benchmark_comparison.md` in the repository root.
41
+
42
+ **Hugging Face Native (Production Phase):**
43
+ | Model | Environment | Avg F1 | Avg Conf |
44
+ |---|---|---|---|
45
+ | `deepseek-ai/DeepSeek-V3` | ✨ HuggingFace | **0.743** | 97% |
46
+ | `Qwen/Qwen2.5-72B-Instruct` | ✨ HuggingFace | **0.358** | 95% |
47
+ | `meta-llama/Meta-Llama-3-8B-Instruct` | ✨ HuggingFace | **0.144** | 96% |
48
+
49
+ **OpenRouter (Final Validation):**
50
+ | Model | Environment | Avg F1 | Avg Conf |
51
+ |---|---|---|---|
52
+ | `deepseek-ai/DeepSeek-V3` | πŸš€ OpenRouter | **0.712** | 92% |
53
+ | `openai/gpt-4o-mini` | πŸš€ OpenRouter | **0.694** | 90% |
54
+ | `meta-llama/llama-3.3-70b-instruct` | πŸš€ OpenRouter | **0.626** | 94% |
55
+ | `qwen/qwen-2.5-72b-instruct` | πŸš€ OpenRouter | **0.619** | 97% |
56
+
57
  ## Tests
58
 
59
  ```bash