Spaces:
Sleeping
Sleeping
Commit Β·
48ab79c
1
Parent(s): 3129333
Update master record with massive confidence table and exact native module names
Browse files- all_record.txt +39 -11
all_record.txt
CHANGED
|
@@ -1,11 +1,40 @@
|
|
| 1 |
=================================================================
|
| 2 |
-
CODE REVIEW OPENENV -
|
| 3 |
=================================================================
|
| 4 |
-
This document contains ONLY the latest, fully validated multi-model executions across OpenRouter and Native Hugging Face networks.
|
| 5 |
|
| 6 |
-
|
| 7 |
-
|
| 8 |
-
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 9 |
=======================================================================
|
| 10 |
CODE REVIEW OPENENV - FINAL FULL OPEN-ROUTER BENCHMARK RUN
|
| 11 |
=======================================================================
|
|
@@ -15,20 +44,20 @@ Target LLM Gateway: https://openrouter.ai/api/v1 (Open Router API)
|
|
| 15 |
|
| 16 |
|
| 17 |
=======================================================================
|
| 18 |
-
--- RUNNING ELITE EXTREME TEST ON MODEL: DeepSeek-V3 (
|
| 19 |
--- Execution Time: 66.95s ---
|
| 20 |
--- STDOUT (Environment Steps & Actions) ---
|
| 21 |
-
[START] task=easy env=code-review-env model=
|
| 22 |
[STEP] step=1 action={"operation":"add_comment","line_number":25,"severity":"minor","category":"bug","message":"The condition 'if include = delta > 0:' uses assignment operator '=' instead of comparison operator '=='. This will always assign the result of 'delta > 0' to 'include' and evaluate to True if 'delta > 0' is True, which is likely unintended.","confidence":95} reward=0.30 done=false error=null
|
| 23 |
[STEP] step=2 action={"operation":"add_comment","line_number":18,"severity":"major","category":"bug","message":"Potential IndexError: The loop iterates up to len(items), but accesses items[i + 1], which will raise an IndexError on the last iteration. Consider looping up to len(items) - 1 instead.","confidence":90} reward=0.30 done=false error=null
|
| 24 |
[STEP] step=3 action={"operation":"done"} reward=0.75 done=true error=null
|
| 25 |
[END] success=true steps=3 score=0.750 rewards=0.30,0.30,0.75
|
| 26 |
-
[START] task=medium env=code-review-env model=
|
| 27 |
[STEP] step=1 action={"operation":"add_comment","line_number":20,"severity":"major","category":"security","message":"Hardcoding sensitive information like API keys in source code is a security risk. This should be stored securely, such as in environment variables or a secrets manager.","confidence":100} reward=0.30 done=false error=null
|
| 28 |
[STEP] step=2 action={"operation":"add_comment","line_number":21,"severity":"critical","category":"security","message":"The SQL query is vulnerable to SQL injection. Use parameterized queries or prepared statements to prevent this security risk.","confidence":100} reward=0.30 done=false error=null
|
| 29 |
[STEP] step=3 action={"operation":"done"} reward=0.67 done=true error=null
|
| 30 |
[END] success=true steps=3 score=0.667 rewards=0.30,0.30,0.67
|
| 31 |
-
[START] task=hard env=code-review-env model=
|
| 32 |
[STEP] step=1 action={"operation":"add_comment","line_number":35,"severity":"critical","category":"security","message":"Using ECB mode for AES encryption is insecure as it does not provide semantic security. Consider using a more secure mode like GCM or CBC with proper initialization vectors.","confidence":95} reward=0.30 done=false error=null
|
| 33 |
[STEP] step=2 action={"operation":"add_comment","line_number":30,"severity":"critical","category":"security","message":"Using yaml.load() without specifying a safe loader is dangerous and can lead to arbitrary code execution. Use yaml.safe_load() instead.","confidence":90} reward=0.35 done=false error=null
|
| 34 |
[STEP] step=3 action={"operation":"add_comment","line_number":47,"severity":"critical","category":"bug","message":"The '_SESSION_CACHE' dictionary is accessed without any synchronization mechanism, which can lead to race conditions in a concurrent environment. Consider using a thread-safe data structure or synchronization primitives like asyncio.Lock.","confidence":85} reward=0.30 done=false error=null
|
|
@@ -230,8 +259,7 @@ Target LLM Gateway: https://openrouter.ai/api/v1 (Open Router API)
|
|
| 230 |
|
| 231 |
|
| 232 |
|
| 233 |
-
[
|
| 234 |
-
--- Telemetry & Benchmark Logs (hf_api_test.txt) ---
|
| 235 |
=======================================================================
|
| 236 |
CODE REVIEW OPENENV - NATIVE HUGGING FACE ROUTER INFERENCE BENCHMARK
|
| 237 |
=======================================================================
|
|
|
|
| 1 |
=================================================================
|
| 2 |
+
CODE REVIEW OPENENV - ULTIMATE MASTER BENCHMARK COMPILATION
|
| 3 |
=================================================================
|
|
|
|
| 4 |
|
| 5 |
+
### π COMPREHENSIVE PERFORMANCE TABLE (Oldest to Latest)
|
| 6 |
+
| Exact Model ID (No Manual Labels) | Iteration Tag | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Confidence |
|
| 7 |
+
|-----------------------------------|---------------|---------|-----------|---------|------------|----------------|
|
| 8 |
+
| qwen/qwen-2.5-72b-instruct | π [Old Baseline] | 0.727 | 0.824 | 0.5 | **0.684** | 95% |
|
| 9 |
+
| deepseek/deepseek-chat | π [Old Baseline] | 0.999 | 0.667 | 0.8 | **0.822** | 96% |
|
| 10 |
+
| meta-llama/llama-3.3-70b-instruct | π [Old Baseline] | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
|
| 11 |
+
| openai/gpt-4o-mini | π [Old Concurrency] | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
|
| 12 |
+
| deepseek/deepseek-chat | π [Old Concurrency] | 0.999 | 0.667 | 0.621 | **0.762** | 90% |
|
| 13 |
+
| qwen/qwen-2.5-72b-instruct | π [Old Concurrency] | 0.667 | 0.625 | 0.5 | **0.597** | 99% |
|
| 14 |
+
| meta-llama/llama-3.1-70b-instruct | π [Old Concurrency] | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
|
| 15 |
+
| deepseek/deepseek-chat | π [Old Live OpenRouter] | 0.6 | 0.667 | 0.5 | **0.589** | 94% |
|
| 16 |
+
| qwen/qwen-2.5-72b-instruct | π [Old Live OpenRouter] | 0.5 | 0.588 | 0.5 | **0.529** | 98% |
|
| 17 |
+
| openai/gpt-4o-mini | π [Old Live OpenRouter] | 0.6 | 0.667 | 0.324 | **0.530** | 90% |
|
| 18 |
+
| meta-llama/llama-3.3-70b-instruct | π [Old Live OpenRouter] | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
|
| 19 |
+
| mistralai/mistral-small-3.1-24b-instruct | π [Old Live OpenRouter] | 0.1 | 0.333 | 0.999 | **0.477** | 100% |
|
| 20 |
+
| deepseek-ai/DeepSeek-V3 | β
[Latest HuggingFace NATIVE] | 0.667 | 0.999 | 0.564 | **0.743** | 97% |
|
| 21 |
+
| Qwen/Qwen2.5-72B-Instruct | β
[Latest HuggingFace NATIVE] | 0.2 | 0.588 | 0.286 | **0.358** | 95% |
|
| 22 |
+
| meta-llama/Llama-3.3-70B-Instruct | β
[Latest HuggingFace NATIVE] | 0.001 | 0.001 | 0.001 | **0.001** | N/A |
|
| 23 |
+
| mistralai/Mixtral-8x7B-Instruct-v0.1 | β
[Latest HuggingFace NATIVE] | 0.001 | 0.001 | 0.001 | **0.001** | N/A |
|
| 24 |
+
| meta-llama/Meta-Llama-3-8B-Instruct | β
[Latest HuggingFace NATIVE] | 0.429 | 0.001 | 0.001 | **0.144** | 96% |
|
| 25 |
+
| deepseek/deepseek-chat | β
[Latest OpenRouter] | 0.75 | 0.667 | 0.72 | **0.712** | 92% |
|
| 26 |
+
| qwen/qwen-2.5-72b-instruct | β
[Latest OpenRouter] | 0.8 | 0.556 | 0.5 | **0.619** | 97% |
|
| 27 |
+
| openai/gpt-4o-mini | β
[Latest OpenRouter] | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
|
| 28 |
+
| meta-llama/llama-3.3-70b-instruct | β
[Latest OpenRouter] | 0.5 | 0.833 | 0.545 | **0.626** | 94% |
|
| 29 |
+
| mistralai/mistral-small-3.1-24b-instruct | β
[Latest OpenRouter] | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
| 30 |
+
|
| 31 |
+
---
|
| 32 |
+
|
| 33 |
+
### π EXCLUSIVE RAW TERMINAL LOGS (LATEST EXECUTIONS ONLY)
|
| 34 |
+
Below are the unedited, raw STDOUT/STDERR logs exclusively for the LATEST HuggingFace Native API and OpenRouter API benchmarks. Older raw logs have been stripped directly per user instruction.
|
| 35 |
+
|
| 36 |
+
|
| 37 |
+
--- [[[ LOG ARCHIVE: new benchmarks.txt ]]] ---
|
| 38 |
=======================================================================
|
| 39 |
CODE REVIEW OPENENV - FINAL FULL OPEN-ROUTER BENCHMARK RUN
|
| 40 |
=======================================================================
|
|
|
|
| 44 |
|
| 45 |
|
| 46 |
=======================================================================
|
| 47 |
+
--- RUNNING ELITE EXTREME TEST ON MODEL: DeepSeek-V3 (deepseek/deepseek-chat) ---
|
| 48 |
--- Execution Time: 66.95s ---
|
| 49 |
--- STDOUT (Environment Steps & Actions) ---
|
| 50 |
+
[START] task=easy env=code-review-env model=deepseek/deepseek-chat
|
| 51 |
[STEP] step=1 action={"operation":"add_comment","line_number":25,"severity":"minor","category":"bug","message":"The condition 'if include = delta > 0:' uses assignment operator '=' instead of comparison operator '=='. This will always assign the result of 'delta > 0' to 'include' and evaluate to True if 'delta > 0' is True, which is likely unintended.","confidence":95} reward=0.30 done=false error=null
|
| 52 |
[STEP] step=2 action={"operation":"add_comment","line_number":18,"severity":"major","category":"bug","message":"Potential IndexError: The loop iterates up to len(items), but accesses items[i + 1], which will raise an IndexError on the last iteration. Consider looping up to len(items) - 1 instead.","confidence":90} reward=0.30 done=false error=null
|
| 53 |
[STEP] step=3 action={"operation":"done"} reward=0.75 done=true error=null
|
| 54 |
[END] success=true steps=3 score=0.750 rewards=0.30,0.30,0.75
|
| 55 |
+
[START] task=medium env=code-review-env model=deepseek/deepseek-chat
|
| 56 |
[STEP] step=1 action={"operation":"add_comment","line_number":20,"severity":"major","category":"security","message":"Hardcoding sensitive information like API keys in source code is a security risk. This should be stored securely, such as in environment variables or a secrets manager.","confidence":100} reward=0.30 done=false error=null
|
| 57 |
[STEP] step=2 action={"operation":"add_comment","line_number":21,"severity":"critical","category":"security","message":"The SQL query is vulnerable to SQL injection. Use parameterized queries or prepared statements to prevent this security risk.","confidence":100} reward=0.30 done=false error=null
|
| 58 |
[STEP] step=3 action={"operation":"done"} reward=0.67 done=true error=null
|
| 59 |
[END] success=true steps=3 score=0.667 rewards=0.30,0.30,0.67
|
| 60 |
+
[START] task=hard env=code-review-env model=deepseek/deepseek-chat
|
| 61 |
[STEP] step=1 action={"operation":"add_comment","line_number":35,"severity":"critical","category":"security","message":"Using ECB mode for AES encryption is insecure as it does not provide semantic security. Consider using a more secure mode like GCM or CBC with proper initialization vectors.","confidence":95} reward=0.30 done=false error=null
|
| 62 |
[STEP] step=2 action={"operation":"add_comment","line_number":30,"severity":"critical","category":"security","message":"Using yaml.load() without specifying a safe loader is dangerous and can lead to arbitrary code execution. Use yaml.safe_load() instead.","confidence":90} reward=0.35 done=false error=null
|
| 63 |
[STEP] step=3 action={"operation":"add_comment","line_number":47,"severity":"critical","category":"bug","message":"The '_SESSION_CACHE' dictionary is accessed without any synchronization mechanism, which can lead to race conditions in a concurrent environment. Consider using a thread-safe data structure or synchronization primitives like asyncio.Lock.","confidence":85} reward=0.30 done=false error=null
|
|
|
|
| 259 |
|
| 260 |
|
| 261 |
|
| 262 |
+
--- [[[ LOG ARCHIVE: hf_api_test.txt ]]] ---
|
|
|
|
| 263 |
=======================================================================
|
| 264 |
CODE REVIEW OPENENV - NATIVE HUGGING FACE ROUTER INFERENCE BENCHMARK
|
| 265 |
=======================================================================
|