openbmb
/

Eurux-8x22b-nca

@@ -41,19 +41,19 @@ Eurux-8x22B-NCA is SFT and [NCA](https://arxiv.org/abs/2402.05369) fine-tuned fr
 It achieves superb reasoning performance as well as exellent chat & instruction-following capabilities.
 ## Evaluation
-We conducted overall coding, math, reasoning, knowledge, instruction-following and chat benchmarking. Results are shown below:
-| Models/Benchmarks|   Coding  |       |          |   Math  |       |           | Reasoning | Knowledge | Ins-Following |   Chat   |
-|-----------------|:---------:|:-----:|:--------:|:-------:|:-----:|:---------:|:---------:|:---------:|:-------------:|:--------:|
-|                 | HumanEval |  MBPP | LeetCode | GSMPLUS |  MATH | TheoremQA | BBH (CoT) |    MMLU   |     IFEval    | MT-Bench |
-| GPT-3.5-Turbo   |   76.8    | 82.5  |   23.3   |  61.2   | 37.8  |   35.6    |   70.1    |   70.0    |     56.6      |   7.94   |
-| GPT-4           |   85.4    | 83.5  |   41.8   |  85.6   | 69.7  |   52.4    |   86.7    |   86.4    |     79.7      |   8.96   |
-| Mixtral-8x7B-Ins|   50.6    | 50.1  |   5.6    |  49.6   | 25.9  |   20.4    |   73.5    |   70.3    |     48.8      |   8.30   |
-| DS-LM-67B-Chat  |   70.7    | 65.7  |   20.0   |  65.0   | 41.0  |   17.9    |   78.9    |   72.3    |     52.7      |     -    |
-| QWen-1.5-72B    |   71.3    | 56.9  |   15.6   |  65.4   | 43.4  |   18.5    |   78.0    |   72.9    |     53.4      |   8.61   |
-| Eurus-70b-NCA   |   79.3    | 71.9  |   33.3   |  62.8   | 41.7  |   32.6    |   80.0    |   59.4    |     49.2      |   7.54   |
-| Eurux-8x22b-KTO |   71.3    | 68.9  |   29.4   |  68.3   | 48.4  |   35.3    |   83.6    |   75.9    |     67.1      |   8.58   |
-| Eurux-8x22b-NCA |   75.0    | 69.7  |   35.0   |  68.1   | 49.0  |   35.5    |   83.5    |   75.6    |     67.1      |   8.46   |
 ## Usage

 It achieves superb reasoning performance as well as exellent chat & instruction-following capabilities.
 ## Evaluation
+We conducted overall coding, math, reasoning, knowledge, instruction-following and chat benchmarking. Results are shown below, with the best scores in open-source models **bolded**:
+| Models/Benchmarks |   Coding  |           |           |    Math   |           |           | Reasoning | Knowledge | Ins-Following |    Chat   |
+|-------------------|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:---------:|:-------------:|:---------:|
+|                   | HumanEval |    MBPP   |  LeetCode |  GSMPLUS  |    MATH   | TheoremQA | BBH (CoT) |    MMLU   |     IFEval    |  MT-Bench |
+| GPT-3.5-Turbo     |   76.8    |   82.5    |   23.3    |   61.2    |   37.8    |   35.6    |   70.1    |   70.0    |     56.6      |   7.94    |
+| GPT-4             |   85.4    |   83.5    |   41.8    |   85.6    |   69.7    |   52.4    |   86.7    |   86.4    |     79.7      |   8.96    |
+| Mixtral-8x7B-Ins  |   50.6    |   50.1    |    5.6    |   49.6    |   25.9    |   20.4    |   73.5    |   70.3    |     48.8      |   8.30    |
+| DS-LM-67B-Chat    |   70.7    |   65.7    |   20.0    |   65.0    |   41.0    |   17.9    |   78.9    |   72.3    |     52.7      |    8.35   |
+| QWen-1.5-72B      |   71.3    |   56.9    |   15.6    |   65.4    |   43.4    |   18.5    |   78.0    |   72.9    |     53.4      | **8.61 ** |
+| Eurus-70b-NCA     | **79.3 ** | **71.9 ** |   33.3    |   62.8    |   41.7    |   32.6    |   80.0    |   59.4    |     49.2      |   7.54    |
+| Eurux-8x22b-KTO   |   71.3    |   68.9    |   29.4    | **68.3 ** |   48.4    |   35.3    | **83.6 ** | **75.9 ** |   **67.1 **   |   8.58    |
+| Eurux-8x22b-NCA   |   75.0    |   69.7    | **35.0 ** |   68.1    | **49.0 ** | **35.5 ** |   83.5    |   75.6    |   **67.1 **   |   8.46    |
 ## Usage