V1 without filtered: {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 94, 'intermediate': 52, 'hard': 29} easy: 94.00%, intermediate: 52.00%, hard: 29.00%
v2 with filtered: {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 93, 'intermediate': 67, 'hard': 25} easy: 93.00%, intermediate: 67.00%, hard: 25.00%
V2 without filtered (add new data): {'easy': 100, 'intermediate': 100, 'hard': 100} {'easy': 88, 'intermediate': 71, 'hard': 28} easy: 88.00%, intermediate: 71.00%, hard: 28.00%
Without context - inferenceV2.py results (without context):
| Config | Easy | Intermediate | Hard | Std Dev | Balanced Ranking |
|---|---|---|---|---|---|
| temp1.1_qwen3-14B_finetuned.json | 88% | 64% | 44% | 18.23 | 🥇 Most balanced |
| temp1.0_qwen3-14B_finetuned.json | 86% | 66% | 42% | 18.71 | 🥈 |
| temp0.7_qwen3-14B_finetuned.json | 92% | 68% | 28% | 26.42 | 🥉 |
| temp0.5_qwen3-14B_finetuned.json | 92% | 62% | 30% | 25.50 | 4th |
| temp0.3_qwen3-14B_finetuned.json | 94% | 54% | 22% | 30.06 | 5th |
| temp0.1_qwen3-14B_finetuned.json | 90% | 62% | 22% | 28.12 | 6th |
| temp0.3_qwen3-14B_base.json | 94% | 46% | 8% | 38.14 | 7th |
| temp1.0_qwen3-14B_base.json | 96% | 52% | 8% | 39.44 | 8th |
| temp0.5_qwen3-14B_base.json | 96% | 48% | 6% | 41.45 | 9th |
| temp0.1_qwen3-14B_base.json | 96% | 46% | 6% | 41.76 | 10th |
| temp0.7_qwen3-14B_base.json | 94% | 38% | 6% | 43.39 | 11th |
| temp1.1_qwen3-14B_base.json | 94.44% | 44.44% | 5.56% | 39.96 | 12th |
With context - inferenceV3.py results (with context):
| Model/Temp | Easy | Intermediate | Hard | Average Accuracy |
|---|---|---|---|---|
| temp1.1_qwen3-14B_finetuned_with_defs | 74.00% | 70.00% | 46.00% | 63.33% |
| temp1.0_qwen3-14B_finetuned_with_defs | 88.00% | 66.00% | 44.00% | 66.00% |
| temp0.7_qwen3-14B_finetuned_with_defs | 94.00% | 74.00% | 32.00% | 66.67% |
| temp0.5_qwen3-14B_finetuned_with_defs | 86.00% | 76.00% | 24.00% | 62.00% |
| temp0.3_qwen3-14B_finetuned_with_defs | 86.00% | 70.00% | 24.00% | 60.00% |
| temp0.1_qwen3-14B_finetuned_with_defs | 90.00% | 64.00% | 28.00% | 60.67% |
| temp1.1_qwen3-14B_base_with_defs | 96.00% | 50.00% | 14.00% | 53.33% |
| temp1.0_qwen3-14B_base_with_defs | 96.00% | 58.00% | 12.00% | 55.33% |
| temp0.7_qwen3-14B_base_with_defs | 96.00% | 58.00% | 10.00% | 54.67% |
| temp0.5_qwen3-14B_base_with_defs | 95.56% | 62.22% | 6.67% | 54.82% |
| temp0.3_qwen3-14B_base_with_defs | 96.00% | 58.00% | 10.00% | 54.67% |
| temp0.1_qwen3-14B_base_with_defs | 96.00% | 58.00% | 8.00% | 54.00% |