Spaces:

DeepParmar
/

code-review

Sleeping

App Files Files Community

DeepParmar commited on 16 days ago

Commit

9e79ae0

1 Parent(s): 4e7c1df

final commit v2

Browse files

Files changed (2) hide show

.gitignore +3 -1
benchmark_comparison.md +4 -4

.gitignore CHANGED Viewed

@@ -30,4 +30,6 @@ latest-bench.md
 # Temporary test runners
 prompts/
-AUDIT_RESULTS.md

 # Temporary test runners
 prompts/
+AUDIT_RESULTS.md
+final_checklist.md
+REQUIREMENTS_CHECKLIST.md

benchmark_comparison.md CHANGED Viewed

@@ -7,15 +7,15 @@ Throughout the ascending environments, score clamping was mathematically refined
 ### 🥇 MASTER HISTORICAL BENCHMARK RESULTS
 | Exact Model ID (No Manual Labels) | Phase | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
 | :-------------------------------- | :---- | :------ | :-------- | :------ | :--------- | :-------- |
-| `deepseek/deepseek-chat` | 🕒 *Old Baseline* | 0.999 | 0.667 | 0.800 | **0.822** | 96% |
 | `qwen/qwen-2.5-72b-instruct` | 🕒 *Old Baseline* | 0.727 | 0.824 | 0.500 | **0.684** | 95% |
 | `meta-llama/llama-3.3-70b-instruct`| 🕒 *Old Baseline* | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
-| `deepseek/deepseek-chat` | 🕒 *Old Concurrency* | 0.999 | 0.667 | 0.621 | **0.762** | 90% |
 | `meta-llama/llama-3.1-70b-instruct`| 🕒 *Old Concurrency* | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
 | `qwen/qwen-2.5-72b-instruct` | 🕒 *Old Concurrency* | 0.667 | 0.625 | 0.500 | **0.597** | 99% |
 | `openai/gpt-4o-mini` | 🕒 *Old Concurrency* | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
 | `meta-llama/llama-3.3-70b-instruct`| 🕒 *Live OpenRouter* | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
-| `deepseek/deepseek-chat` | 🕒 *Live OpenRouter* | 0.600 | 0.667 | 0.500 | **0.589** | 94% |
 | `openai/gpt-4o-mini` | 🕒 *Live OpenRouter* | 0.600 | 0.667 | 0.324 | **0.530** | 90% |
 | `qwen/qwen-2.5-72b-instruct` | 🕒 *Live OpenRouter* | 0.500 | 0.588 | 0.500 | **0.529** | 98% |
 | `mistralai/mistral-small-3.1-24b` | 🕒 *Live OpenRouter* | 0.100 | 0.333 | 0.999 | **0.477** | 100% |
@@ -43,7 +43,7 @@ Throughout the ascending environments, score clamping was mathematically refined
 | Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
 | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
-| `deepseek/deepseek-chat` | 🚀 **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
 | `openai/gpt-4o-mini` | 🚀 **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
 | `meta-llama/llama-3.3-70b-instruct` | 🚀 **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
 | `qwen/qwen-2.5-72b-instruct` | 🚀 **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |

 ### 🥇 MASTER HISTORICAL BENCHMARK RESULTS
 | Exact Model ID (No Manual Labels) | Phase | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
 | :-------------------------------- | :---- | :------ | :-------- | :------ | :--------- | :-------- |
+| `deepseek-ai/DeepSeek-V3` | 🕒 *Old Baseline* | 0.999 | 0.667 | 0.800 | **0.822** | 96% |
 | `qwen/qwen-2.5-72b-instruct` | 🕒 *Old Baseline* | 0.727 | 0.824 | 0.500 | **0.684** | 95% |
 | `meta-llama/llama-3.3-70b-instruct`| 🕒 *Old Baseline* | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
+| `deepseek-ai/DeepSeek-V3` | 🕒 *Old Concurrency* | 0.999 | 0.667 | 0.621 | **0.762** | 90% |
 | `meta-llama/llama-3.1-70b-instruct`| 🕒 *Old Concurrency* | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
 | `qwen/qwen-2.5-72b-instruct` | 🕒 *Old Concurrency* | 0.667 | 0.625 | 0.500 | **0.597** | 99% |
 | `openai/gpt-4o-mini` | 🕒 *Old Concurrency* | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
 | `meta-llama/llama-3.3-70b-instruct`| 🕒 *Live OpenRouter* | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
+| `deepseek-ai/DeepSeek-V3` | 🕒 *Live OpenRouter* | 0.600 | 0.667 | 0.500 | **0.589** | 94% |
 | `openai/gpt-4o-mini` | 🕒 *Live OpenRouter* | 0.600 | 0.667 | 0.324 | **0.530** | 90% |
 | `qwen/qwen-2.5-72b-instruct` | 🕒 *Live OpenRouter* | 0.500 | 0.588 | 0.500 | **0.529** | 98% |
 | `mistralai/mistral-small-3.1-24b` | 🕒 *Live OpenRouter* | 0.100 | 0.333 | 0.999 | **0.477** | 100% |
 | Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
 | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
+| `deepseek-ai/DeepSeek-V3` | 🚀 **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
 | `openai/gpt-4o-mini` | 🚀 **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
 | `meta-llama/llama-3.3-70b-instruct` | 🚀 **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
 | `qwen/qwen-2.5-72b-instruct` | 🚀 **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |