Spaces:
Sleeping
Sleeping
Commit Β·
9e79ae0
1
Parent(s): 4e7c1df
final commit v2
Browse files- .gitignore +3 -1
- benchmark_comparison.md +4 -4
.gitignore
CHANGED
|
@@ -30,4 +30,6 @@ latest-bench.md
|
|
| 30 |
|
| 31 |
# Temporary test runners
|
| 32 |
prompts/
|
| 33 |
-
AUDIT_RESULTS.md
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
# Temporary test runners
|
| 32 |
prompts/
|
| 33 |
+
AUDIT_RESULTS.md
|
| 34 |
+
final_checklist.md
|
| 35 |
+
REQUIREMENTS_CHECKLIST.md
|
benchmark_comparison.md
CHANGED
|
@@ -7,15 +7,15 @@ Throughout the ascending environments, score clamping was mathematically refined
|
|
| 7 |
### π₯ MASTER HISTORICAL BENCHMARK RESULTS
|
| 8 |
| Exact Model ID (No Manual Labels) | Phase | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 9 |
| :-------------------------------- | :---- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 10 |
-
| `deepseek/
|
| 11 |
| `qwen/qwen-2.5-72b-instruct` | π *Old Baseline* | 0.727 | 0.824 | 0.500 | **0.684** | 95% |
|
| 12 |
| `meta-llama/llama-3.3-70b-instruct`| π *Old Baseline* | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
|
| 13 |
-
| `deepseek/
|
| 14 |
| `meta-llama/llama-3.1-70b-instruct`| π *Old Concurrency* | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
|
| 15 |
| `qwen/qwen-2.5-72b-instruct` | π *Old Concurrency* | 0.667 | 0.625 | 0.500 | **0.597** | 99% |
|
| 16 |
| `openai/gpt-4o-mini` | π *Old Concurrency* | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
|
| 17 |
| `meta-llama/llama-3.3-70b-instruct`| π *Live OpenRouter* | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
|
| 18 |
-
| `deepseek/
|
| 19 |
| `openai/gpt-4o-mini` | π *Live OpenRouter* | 0.600 | 0.667 | 0.324 | **0.530** | 90% |
|
| 20 |
| `qwen/qwen-2.5-72b-instruct` | π *Live OpenRouter* | 0.500 | 0.588 | 0.500 | **0.529** | 98% |
|
| 21 |
| `mistralai/mistral-small-3.1-24b` | π *Live OpenRouter* | 0.100 | 0.333 | 0.999 | **0.477** | 100% |
|
|
@@ -43,7 +43,7 @@ Throughout the ascending environments, score clamping was mathematically refined
|
|
| 43 |
|
| 44 |
| Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 45 |
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 46 |
-
| `deepseek/
|
| 47 |
| `openai/gpt-4o-mini` | π **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
|
| 48 |
| `meta-llama/llama-3.3-70b-instruct` | π **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 49 |
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
|
|
|
| 7 |
### π₯ MASTER HISTORICAL BENCHMARK RESULTS
|
| 8 |
| Exact Model ID (No Manual Labels) | Phase | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 9 |
| :-------------------------------- | :---- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 10 |
+
| `deepseek-ai/DeepSeek-V3` | π *Old Baseline* | 0.999 | 0.667 | 0.800 | **0.822** | 96% |
|
| 11 |
| `qwen/qwen-2.5-72b-instruct` | π *Old Baseline* | 0.727 | 0.824 | 0.500 | **0.684** | 95% |
|
| 12 |
| `meta-llama/llama-3.3-70b-instruct`| π *Old Baseline* | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
|
| 13 |
+
| `deepseek-ai/DeepSeek-V3` | π *Old Concurrency* | 0.999 | 0.667 | 0.621 | **0.762** | 90% |
|
| 14 |
| `meta-llama/llama-3.1-70b-instruct`| π *Old Concurrency* | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
|
| 15 |
| `qwen/qwen-2.5-72b-instruct` | π *Old Concurrency* | 0.667 | 0.625 | 0.500 | **0.597** | 99% |
|
| 16 |
| `openai/gpt-4o-mini` | π *Old Concurrency* | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
|
| 17 |
| `meta-llama/llama-3.3-70b-instruct`| π *Live OpenRouter* | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
|
| 18 |
+
| `deepseek-ai/DeepSeek-V3` | π *Live OpenRouter* | 0.600 | 0.667 | 0.500 | **0.589** | 94% |
|
| 19 |
| `openai/gpt-4o-mini` | π *Live OpenRouter* | 0.600 | 0.667 | 0.324 | **0.530** | 90% |
|
| 20 |
| `qwen/qwen-2.5-72b-instruct` | π *Live OpenRouter* | 0.500 | 0.588 | 0.500 | **0.529** | 98% |
|
| 21 |
| `mistralai/mistral-small-3.1-24b` | π *Live OpenRouter* | 0.100 | 0.333 | 0.999 | **0.477** | 100% |
|
|
|
|
| 43 |
|
| 44 |
| Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 45 |
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 46 |
+
| `deepseek-ai/DeepSeek-V3` | π **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
|
| 47 |
| `openai/gpt-4o-mini` | π **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
|
| 48 |
| `meta-llama/llama-3.3-70b-instruct` | π **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 49 |
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|