microsoft
/

GRIN-MoE

@@ -69,29 +69,30 @@ To understand the capabilities, we compare GRIN MoE with a set of models over a
 ### Popular Benchmarks
-|               | GRIN MoE (16x3.8B) | Mixtral (8x7B) | Mixtral (8x22B) | Llama3 (8B) | Llama3 (70B) | GPT3.5 | GPT4o |
-|---------------|-----------|---------|---------|--------|--------|--------|-------|
-| MMLU          | 79.4      | 70.5    | 76.2    | 66.5   | 80.2   | 71.4   | 86.9  |
-| HellaSwag     | 83.7      | 70.4    | 79.0    | 71.1   | 82.6   | 78.8   | 91.7  |
-| ANLI          | 60.6      | 55.2    | 65.2    | 57.3   | 68.3   | 58.1   | 75.7  |
-| GSM-8K        | 90.4      | 64.7    | 83.8    | 77.4   | 93.5   | 78.1   | 93.8  |
-| Math          | 58.9      | 11.1    | 41.8    | 28.2   | 51.2   | 45.3   | 67.8  |
 | MedQA         | 70.4      | 62.2    | 67.9    | 60.5   | 78.5   | 63.4   | 88.9  |
-| AGIEval       | 48.2      | 45.2    | 54.0    | 42.0   | 56.9   | 48.4   | 37.6  |
-| TriviaQA      | 73.9      | 78.5    | 82.2    | 67.7   | 84.5   | 85.8   | 66.0  |
-| Arc-C         | 92.0      | 87.3    | 91.3    | 82.8   | 93.0   | 87.4   | 97.0  |
-| Arc-E         | 98.0      | 95.6    | 96.9    | 93.4   | 98.2   | 96.3   | 99.0  |
-| PIQA          | 89.0      | 86.0    | 85.0    | 75.7   | 85.3   | 86.6   | 92.9  |
-| SociQA        | 79.5      | 75.9    | 78.2    | 73.9   | 81.1   | 68.3   | 81.4  |
-| BigBench-Hard | 81.4      | 69.7    | 81.8    | 51.5   | 80.2   | 68.3   | 81.2  |
-| WinoGrande    | 81.4      | 62.0    | 75.3    | 65.0   | 83.3   | 68.8   | 89.3  |
-| OpenBookQA    | 89.8      | 85.8    | 88.6    | 82.6   | 91.8   | 86.0   | 95.2  |
-| BoolQ         | 83.4      | 77.6    | 82.7    | 80.9   | 89.1   | 79.1   | 90.6  |
-| CommonSenseQA | 81.8      | 78.1    | 82.0    | 79.0   | 84.4   | 79.6   | 88.5  |
-| TruthfulQA    | 74.5      | 60.1    | 67.4    | 63.2   | 81.9   | 85.8   | 85.6  |
-| HumanEval     | 74.4      | 37.8    | 39.6    | 60.4   | 78.7   | 62.2   | 92.1  |
-| MBPP          | 80.3      | 60.2    | 70.7    | 67.7   | 81.3   | 77.8   | 90.4  |
-| Average       | 78.6      | 66.7    | 74.5    | 67.3   | 81.2   | 73.8   | 84.8  |
 ### Livebench
 Performance on LiveBench-2024-07-25. Models are ranked by their average score (AVG). *Baseline results are referenced from the official benchmark.

 ### Popular Benchmarks
+Note a different version of mid-training and post-training, emphasizing long context and multilingual ability, has been conducted and has been released at https://huggingface.co/microsoft/Phi-3.5-MoE-instruct.
+|               | GRIN MoE (16x3.8B) | Phi-3.5-MoE (16x3.8B) | Mixtral (8x7B) | Mixtral (8x22B) | Llama3 (8B) | Llama3 (70B) | GPT3.5 | GPT4o |
+|---------------|-----------|---------|---------|---------|--------|--------|--------|-------|
+| MMLU          | 79.4      | 78.9    | 70.5    | 76.2    | 66.5   | 80.2   | 71.4   | 86.9  |
+| HellaSwag     | 83.7      | 83.8    | 70.4    | 79.0    | 71.1   | 82.6   | 78.8   | 91.7  |
+| ANLI          | 60.6      | 59.8    | 55.2    | 65.2    | 57.3   | 68.3   | 58.1   | 75.7  |
+| GSM-8K        | 90.4      | 88.7    | 64.7    | 83.8    | 77.4   | 93.5   | 78.1   | 93.8  |
 | MedQA         | 70.4      | 62.2    | 67.9    | 60.5   | 78.5   | 63.4   | 88.9  |
+| AGIEval       | 48.2      | 50.3    | 45.2    | 54.0    | 42.0   | 56.9   | 48.4   | 37.6  |
+| TriviaQA      | 73.9      | 71.6    | 78.5    | 82.2    | 67.7   | 84.5   | 85.8   | 66.0  |
+| Arc-C         | 92.0      | 91.0    | 87.3    | 91.3    | 82.8   | 93.0   | 87.4   | 97.0  |
+| Arc-E         | 98.0      | 97.1    | 95.6    | 96.9    | 93.4   | 98.2   | 96.3   | 99.0  |
+| PIQA          | 89.0      | 88.6    | 86.0    | 85.0    | 75.7   | 85.3   | 86.6   | 92.9  |
+| SociQA        | 79.5      | 78.0    | 75.9    | 78.2    | 73.9   | 81.1   | 68.3   | 81.4  |
+| BigBench-Hard | 81.4      | 79.1    | 69.7    | 81.8    | 51.5   | 80.2   | 68.3   | 81.2  |
+| WinoGrande    | 81.4      | 81.3    | 62.0    | 75.3    | 65.0   | 83.3   | 68.8   | 89.3  |
+| OpenBookQA    | 89.8      | 89.6    | 85.8    | 88.6    | 82.6   | 91.8   | 86.0   | 95.2  |
+| BoolQ         | 83.4      | 84.5    | 77.6    | 82.7    | 80.9   | 89.1   | 79.1   | 90.6  |
+| CommonSenseQA | 81.8      | 83.5    | 78.1    | 82.0    | 79.0   | 84.4   | 79.6   | 88.5  |
+| TruthfulQA    | 74.5      | 77.5    | 60.1    | 67.4    | 63.2   | 81.9   | 85.8   | 85.6  |
+| HumanEval     | 74.4      | 70.7    | 37.8    | 39.6    | 60.4   | 78.7   | 62.2   | 92.1  |
+| MBPP          | 80.3      | 80.8    | 60.2    | 70.7    | 67.7   | 81.3   | 77.8   | 90.4  |
+| Average       | 79.6      | 79.2    | 69.6    | 76.2    | 69.4   | 82.8   | 75.2   | 85.7  |
 ### Livebench
 Performance on LiveBench-2024-07-25. Models are ranked by their average score (AVG). *Baseline results are referenced from the official benchmark.