LiyuanLucasLiu
commited on
Commit
•
e87098c
1
Parent(s):
c4cfe3e
Update README.md
Browse filesadded phi3.5 result
README.md
CHANGED
@@ -69,29 +69,30 @@ To understand the capabilities, we compare GRIN MoE with a set of models over a
|
|
69 |
|
70 |
### Popular Benchmarks
|
71 |
|
72 |
-
|
73 |
-
|
74 |
-
|
|
75 |
-
|
76 |
-
|
|
77 |
-
|
|
78 |
-
|
|
|
|
79 |
| MedQA | 70.4 | 62.2 | 67.9 | 60.5 | 78.5 | 63.4 | 88.9 |
|
80 |
-
| AGIEval | 48.2 | 45.2 | 54.0 | 42.0 | 56.9 | 48.4 | 37.6 |
|
81 |
-
| TriviaQA | 73.9 | 78.5 | 82.2 | 67.7 | 84.5 | 85.8 | 66.0 |
|
82 |
-
| Arc-C | 92.0 | 87.3 | 91.3 | 82.8 | 93.0 | 87.4 | 97.0 |
|
83 |
-
| Arc-E | 98.0 | 95.6 | 96.9 | 93.4 | 98.2 | 96.3 | 99.0 |
|
84 |
-
| PIQA | 89.0 | 86.0 | 85.0 | 75.7 | 85.3 | 86.6 | 92.9 |
|
85 |
-
| SociQA | 79.5 | 75.9 | 78.2 | 73.9 | 81.1 | 68.3 | 81.4 |
|
86 |
-
| BigBench-Hard | 81.4 | 69.7 | 81.8 | 51.5 | 80.2 | 68.3 | 81.2 |
|
87 |
-
| WinoGrande | 81.4 | 62.0 | 75.3 | 65.0 | 83.3 | 68.8 | 89.3 |
|
88 |
-
| OpenBookQA | 89.8 | 85.8 | 88.6 | 82.6 | 91.8 | 86.0 | 95.2 |
|
89 |
-
| BoolQ | 83.4 | 77.6 | 82.7 | 80.9 | 89.1 | 79.1 | 90.6 |
|
90 |
-
| CommonSenseQA | 81.8 | 78.1 | 82.0 | 79.0 | 84.4 | 79.6 | 88.5 |
|
91 |
-
| TruthfulQA | 74.5 | 60.1 | 67.4 | 63.2 | 81.9 | 85.8 | 85.6 |
|
92 |
-
| HumanEval | 74.4 | 37.8 | 39.6 | 60.4 | 78.7 | 62.2 | 92.1 |
|
93 |
-
| MBPP | 80.3 | 60.2 | 70.7 | 67.7 | 81.3 | 77.8 | 90.4 |
|
94 |
-
| Average |
|
95 |
|
96 |
### Livebench
|
97 |
Performance on LiveBench-2024-07-25. Models are ranked by their average score (AVG). *Baseline results are referenced from the official benchmark.
|
|
|
69 |
|
70 |
### Popular Benchmarks
|
71 |
|
72 |
+
Note a different version of mid-training and post-training, emphasizing long context and multilingual ability, has been conducted and has been released at https://huggingface.co/microsoft/Phi-3.5-MoE-instruct.
|
73 |
+
|
74 |
+
| | GRIN MoE (16x3.8B) | Phi-3.5-MoE (16x3.8B) | Mixtral (8x7B) | Mixtral (8x22B) | Llama3 (8B) | Llama3 (70B) | GPT3.5 | GPT4o |
|
75 |
+
|---------------|-----------|---------|---------|---------|--------|--------|--------|-------|
|
76 |
+
| MMLU | 79.4 | 78.9 | 70.5 | 76.2 | 66.5 | 80.2 | 71.4 | 86.9 |
|
77 |
+
| HellaSwag | 83.7 | 83.8 | 70.4 | 79.0 | 71.1 | 82.6 | 78.8 | 91.7 |
|
78 |
+
| ANLI | 60.6 | 59.8 | 55.2 | 65.2 | 57.3 | 68.3 | 58.1 | 75.7 |
|
79 |
+
| GSM-8K | 90.4 | 88.7 | 64.7 | 83.8 | 77.4 | 93.5 | 78.1 | 93.8 |
|
80 |
| MedQA | 70.4 | 62.2 | 67.9 | 60.5 | 78.5 | 63.4 | 88.9 |
|
81 |
+
| AGIEval | 48.2 | 50.3 | 45.2 | 54.0 | 42.0 | 56.9 | 48.4 | 37.6 |
|
82 |
+
| TriviaQA | 73.9 | 71.6 | 78.5 | 82.2 | 67.7 | 84.5 | 85.8 | 66.0 |
|
83 |
+
| Arc-C | 92.0 | 91.0 | 87.3 | 91.3 | 82.8 | 93.0 | 87.4 | 97.0 |
|
84 |
+
| Arc-E | 98.0 | 97.1 | 95.6 | 96.9 | 93.4 | 98.2 | 96.3 | 99.0 |
|
85 |
+
| PIQA | 89.0 | 88.6 | 86.0 | 85.0 | 75.7 | 85.3 | 86.6 | 92.9 |
|
86 |
+
| SociQA | 79.5 | 78.0 | 75.9 | 78.2 | 73.9 | 81.1 | 68.3 | 81.4 |
|
87 |
+
| BigBench-Hard | 81.4 | 79.1 | 69.7 | 81.8 | 51.5 | 80.2 | 68.3 | 81.2 |
|
88 |
+
| WinoGrande | 81.4 | 81.3 | 62.0 | 75.3 | 65.0 | 83.3 | 68.8 | 89.3 |
|
89 |
+
| OpenBookQA | 89.8 | 89.6 | 85.8 | 88.6 | 82.6 | 91.8 | 86.0 | 95.2 |
|
90 |
+
| BoolQ | 83.4 | 84.5 | 77.6 | 82.7 | 80.9 | 89.1 | 79.1 | 90.6 |
|
91 |
+
| CommonSenseQA | 81.8 | 83.5 | 78.1 | 82.0 | 79.0 | 84.4 | 79.6 | 88.5 |
|
92 |
+
| TruthfulQA | 74.5 | 77.5 | 60.1 | 67.4 | 63.2 | 81.9 | 85.8 | 85.6 |
|
93 |
+
| HumanEval | 74.4 | 70.7 | 37.8 | 39.6 | 60.4 | 78.7 | 62.2 | 92.1 |
|
94 |
+
| MBPP | 80.3 | 80.8 | 60.2 | 70.7 | 67.7 | 81.3 | 77.8 | 90.4 |
|
95 |
+
| Average | 79.6 | 79.2 | 69.6 | 76.2 | 69.4 | 82.8 | 75.2 | 85.7 |
|
96 |
|
97 |
### Livebench
|
98 |
Performance on LiveBench-2024-07-25. Models are ranked by their average score (AVG). *Baseline results are referenced from the official benchmark.
|