qwen-2.5-14b / README.md
DrishtiSharma's picture
Update README.md
7d55ab7 verified
|
raw
history blame
5.65 kB

we instruction tuned models 011 such as Qwen-2.5-14B-Instruct and Phi-4 on a mixed- 063 language dataset

Model ARC-C ARC-E BoolQ CMCQ MMLU Average* MMLU-Pro GPQA MuSR BBH MATH
AryaBhatta-GemmaUltra-8.5B 22.70 25.04 22.95 62.23 23.70 31.32 22.66 25.34 42.72 41.12 2.95
Airavata-7B 25.09 30.47 25.31 62.17 33.20 35.25 16.35 27.43 37.57 36.00 13.60
sarvam-1-2B 30.03 33.25 62.17 42.80 27.90 39.23 - - - - -
Nemotron-4-Mini-Hindi-Instruct 55.80 71.63 62.11 68.10 43.20 60.17 25.95 30.87 41.53 40.11 2.04
Llama-3-Nanda-10B-Chat 65.36 80.64 82.29 67.60 50.61 69.30 - - - - -
Krutrim-2-12b-instruct 67.32 81.10 84.74 76.30 56.10 73.11 - - - - -
aya-expanse-8b 74.06 87.08 86.45 83.30 56.89 77.56 30.04 30.29 37.17 49.42 7.02
aya-expanse-32B 85.41 95.08 90.43 89.80 69.71 86.08 41.30 32.55 38.62 56.29 13.37
Our Qwen Model (14b) 90.61 94.82 88.53 90.70 75.00 87.93 52.63 36.24 44.84 64.97 25.08
Our Phi Model (14b) 97.39 92.24 87.65 87.40 75.59 88.05 52.39 39.77 49.07 66.97 23.11

Table 1: Metrics (.2f) of our Qwen-2.5-14B and other LLMs over several English benchmarks

Model ARC-C ARC-E BoolQ CMCQ MMLU Average
AryaBhatta-GemmaUltra-8.5B 22.70 25.08 22.95 62.17 23.80 31.34
Airavata-7B 22.87 25.13 23.28 62.17 33.20 33.33
sarvam-1-2B 32.76 35.06 62.16 47.10 24.22 40.26
Llama-3-Nanda-10B-Chat 45.99 60.56 71.96 54.70 36.35 53.91
Nemotron-4-Mini-Hindi-4B-Instruct 50.68 63.72 68.74 51.30 37.18 54.32
Krutrim-2-12b-instruct 56.83 70.66 78.86 64.10 46.51 63.39
aya-expanse-8b 57.42 72.90 80.42 69.00 43.39 64.63
aya-expanse-32B 73.29 85.48 87.73 79.70 56.96 76.63
Our Qwen Model (14b) 74.06 81.23 84.07 78.20 53.85 74.82
Our Phi Model (14b) 81.74 89.06 86.02 78.70 56.39 78.38

Table 2: Metrics (.2f) of our Qwen-2.5-14B and other LLMs over several Hindi benchmarks

Benchmark Lang Qwen-2.5-14B-Instruct Our Qwen Change Phi-4 Our Phi Change
ARC-Easy En 95.45 94.82 πŸ”» 0.63 97.31 97.39 πŸ”Ό 0.08
Hi 78.49 81.23 πŸ”Ό 2.74 86.87 89.06 πŸ”Ό 2.19
ARC-Challenge En 90.87 90.61 πŸ”» 0.26 92.41 92.24 πŸ”» 0.17
Hi 69.62 74.06 πŸ”Ό 4.44 79.18 81.74 πŸ”Ό 2.56
BoolQ En 86.09 88.53 πŸ”Ό 2.44 86.30 87.65 πŸ”Ό 1.35
Hi 78.89 84.07 πŸ”Ό 5.18 82.72 86.02 πŸ”Ό 3.30
Context-MCQ En 91.20 90.70 πŸ”» 0.50 86.30 87.40 πŸ”Ό 1.10
Hi 77.40 78.20 πŸ”Ό 0.80 75.70 78.70 πŸ”Ό 3.00
MMLU En 74.37 75.00 πŸ”Ό 0.63 74.67 75.59 πŸ”Ό 0.92
Hi 52.16 53.85 πŸ”Ό 1.69 53.24 56.39 πŸ”Ό 3.15
Average En 87.60 87.93 πŸ”Ό 0.33 87.40 88.05 πŸ”Ό 0.65
Hi 71.31 74.82 πŸ”Ό 3.51 75.54 78.38 πŸ”Ό 2.84
Overall 79.46 81.38 πŸ”Ό 1.92 81.47 83.22 πŸ”Ό 1.75

Table 3: Performance of our Qwen-2.5-14B model compared to originals over each benchmark : evals through log likelihoods

Benchmark Lang Qwen-2.5-14B-Instruct Our Qwen Change Phi-4 Our Phi Change
MMLU-Pro En 49.04 52.63 πŸ”Ό 3.59 53.78 52.39 πŸ”» 1.39
MATH hard En 00.00 25.08 πŸ”· N/A 12.31 23.11 πŸ”Ό 10.80
GPQA En 32.21 36.24 πŸ”Ό 4.03 33.72 39.77 πŸ”Ό 6.05
MuSR En 40.87 44.84 πŸ”Ό 3.97 41.01 49.07 πŸ”Ό 8.06
BigBench-Hard En 63.74 64.97 πŸ”Ό 1.23 68.60 66.97 πŸ”» 1.63
Average 37.17 44.75 πŸ”Ό 7.58 41.88 46.26 πŸ”Ό 4.38

Table 4: Performance of our Qwen-2.5-14B model compared to originals over each benchmark : evals through eval-harness