DeepParmar commited on
Commit
9e79ae0
Β·
1 Parent(s): 4e7c1df

final commit v2

Browse files
Files changed (2) hide show
  1. .gitignore +3 -1
  2. benchmark_comparison.md +4 -4
.gitignore CHANGED
@@ -30,4 +30,6 @@ latest-bench.md
30
 
31
  # Temporary test runners
32
  prompts/
33
- AUDIT_RESULTS.md
 
 
 
30
 
31
  # Temporary test runners
32
  prompts/
33
+ AUDIT_RESULTS.md
34
+ final_checklist.md
35
+ REQUIREMENTS_CHECKLIST.md
benchmark_comparison.md CHANGED
@@ -7,15 +7,15 @@ Throughout the ascending environments, score clamping was mathematically refined
7
  ### πŸ₯‡ MASTER HISTORICAL BENCHMARK RESULTS
8
  | Exact Model ID (No Manual Labels) | Phase | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
9
  | :-------------------------------- | :---- | :------ | :-------- | :------ | :--------- | :-------- |
10
- | `deepseek/deepseek-chat` | πŸ•’ *Old Baseline* | 0.999 | 0.667 | 0.800 | **0.822** | 96% |
11
  | `qwen/qwen-2.5-72b-instruct` | πŸ•’ *Old Baseline* | 0.727 | 0.824 | 0.500 | **0.684** | 95% |
12
  | `meta-llama/llama-3.3-70b-instruct`| πŸ•’ *Old Baseline* | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
13
- | `deepseek/deepseek-chat` | πŸ•’ *Old Concurrency* | 0.999 | 0.667 | 0.621 | **0.762** | 90% |
14
  | `meta-llama/llama-3.1-70b-instruct`| πŸ•’ *Old Concurrency* | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
15
  | `qwen/qwen-2.5-72b-instruct` | πŸ•’ *Old Concurrency* | 0.667 | 0.625 | 0.500 | **0.597** | 99% |
16
  | `openai/gpt-4o-mini` | πŸ•’ *Old Concurrency* | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
17
  | `meta-llama/llama-3.3-70b-instruct`| πŸ•’ *Live OpenRouter* | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
18
- | `deepseek/deepseek-chat` | πŸ•’ *Live OpenRouter* | 0.600 | 0.667 | 0.500 | **0.589** | 94% |
19
  | `openai/gpt-4o-mini` | πŸ•’ *Live OpenRouter* | 0.600 | 0.667 | 0.324 | **0.530** | 90% |
20
  | `qwen/qwen-2.5-72b-instruct` | πŸ•’ *Live OpenRouter* | 0.500 | 0.588 | 0.500 | **0.529** | 98% |
21
  | `mistralai/mistral-small-3.1-24b` | πŸ•’ *Live OpenRouter* | 0.100 | 0.333 | 0.999 | **0.477** | 100% |
@@ -43,7 +43,7 @@ Throughout the ascending environments, score clamping was mathematically refined
43
 
44
  | Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
45
  | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
46
- | `deepseek/deepseek-chat` | πŸš€ **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
47
  | `openai/gpt-4o-mini` | πŸš€ **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
48
  | `meta-llama/llama-3.3-70b-instruct` | πŸš€ **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
49
  | `qwen/qwen-2.5-72b-instruct` | πŸš€ **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
 
7
  ### πŸ₯‡ MASTER HISTORICAL BENCHMARK RESULTS
8
  | Exact Model ID (No Manual Labels) | Phase | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
9
  | :-------------------------------- | :---- | :------ | :-------- | :------ | :--------- | :-------- |
10
+ | `deepseek-ai/DeepSeek-V3` | πŸ•’ *Old Baseline* | 0.999 | 0.667 | 0.800 | **0.822** | 96% |
11
  | `qwen/qwen-2.5-72b-instruct` | πŸ•’ *Old Baseline* | 0.727 | 0.824 | 0.500 | **0.684** | 95% |
12
  | `meta-llama/llama-3.3-70b-instruct`| πŸ•’ *Old Baseline* | 0.556 | 0.625 | 0.375 | **0.519** | 94% |
13
+ | `deepseek-ai/DeepSeek-V3` | πŸ•’ *Old Concurrency* | 0.999 | 0.667 | 0.621 | **0.762** | 90% |
14
  | `meta-llama/llama-3.1-70b-instruct`| πŸ•’ *Old Concurrency* | 0.833 | 0.636 | 0.545 | **0.671** | 96% |
15
  | `qwen/qwen-2.5-72b-instruct` | πŸ•’ *Old Concurrency* | 0.667 | 0.625 | 0.500 | **0.597** | 99% |
16
  | `openai/gpt-4o-mini` | πŸ•’ *Old Concurrency* | 0.667 | 0.588 | 0.308 | **0.521** | 90% |
17
  | `meta-llama/llama-3.3-70b-instruct`| πŸ•’ *Live OpenRouter* | 0.999 | 0.625 | 0.545 | **0.723** | 95% |
18
+ | `deepseek-ai/DeepSeek-V3` | πŸ•’ *Live OpenRouter* | 0.600 | 0.667 | 0.500 | **0.589** | 94% |
19
  | `openai/gpt-4o-mini` | πŸ•’ *Live OpenRouter* | 0.600 | 0.667 | 0.324 | **0.530** | 90% |
20
  | `qwen/qwen-2.5-72b-instruct` | πŸ•’ *Live OpenRouter* | 0.500 | 0.588 | 0.500 | **0.529** | 98% |
21
  | `mistralai/mistral-small-3.1-24b` | πŸ•’ *Live OpenRouter* | 0.100 | 0.333 | 0.999 | **0.477** | 100% |
 
43
 
44
  | Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
45
  | :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
46
+ | `deepseek-ai/DeepSeek-V3` | πŸš€ **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
47
  | `openai/gpt-4o-mini` | πŸš€ **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
48
  | `meta-llama/llama-3.3-70b-instruct` | πŸš€ **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
49
  | `qwen/qwen-2.5-72b-instruct` | πŸš€ **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |