Qwen
/

Qwen-1_8B-Chat

@@ -281,25 +281,22 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
 #### C-Eval
-在[C-Eval](https://arxiv.org/abs/2305.08322)验证集上，我们评价了Qwen-1.8B-Chat模型的zero-shot准确率
-We demonstrate the zero-shot accuracy of Qwen-1.8B-Chat on C-Eval validation set
-|          Model           | Avg. Acc. |
-|:------------------------:|:---------:|
-|     **Qwen-7B-Chat**     |   54.2    |
-|     InternLM-7B-Chat     |   53.2    |
-|    **Qwen-1.8B-Chat**    |   55.6    |
-|     ChatGLM2-6B-Chat     |   50.7    |
-|    Baichuan-13B-Chat     |   50.4    |
-| Chinese-Alpaca-Plus-13B  |   43.3    |
-|   Chinese-Alpaca-2-7B    |   41.3    |
-|     LLaMA2-13B-Chat      |   40.6    |
-|      LLaMA2-7B-Chat      |   31.9    |
-|   OpenLLaMA-Chinese-3B   |   24.4    |
-|    Firefly-Bloom-1B4     |   23.6    |
-|       OpenBuddy-3B       |   23.5    |
-| RedPajama-INCITE-Chat-3B |   18.3    |
 C-Eval测试集上，Qwen-1.8B-Chat模型的zero-shot准确率结果如下：
@@ -307,35 +304,35 @@ The zero-shot accuracy of Qwen-1.8B-Chat on C-Eval testing set is provided below
 | Model                   |   Avg.   | STEM | Social Sciences | Humanities | Others |
 | :---------------------: | :------: | :--: | :-------------: | :--------: | :----: |
-| **Qwen-7B-Chat**        |   54.6   | 47.8 |      67.6       |    59.3    |  50.6  |
-| Baichuan-13B-Chat       |   51.5   | 43.7 |      64.6       |    56.2    |  49.2  |
-| ChatGLM2-6B-Chat        |   50.1   | 46.4 |      60.4       |    50.6    |  46.9  |
-| **Qwen-1.8B-Chat**      |   53.8   | 48.4 |      68.0       |    56.5    |  48.3  |
 | Chinese-Alpaca-Plus-13B |   41.5   | 36.6 |      49.7       |    43.1    |  41.2  |
 | Chinese-Alpaca-2-7B     |   40.3   |  -   |        -        |     -      |   -    |
 ### 英文评测（English Evaluation）
 #### MMLU
-[MMLU](https://arxiv.org/abs/2009.03300)评测集上，Qwen-1.8B-Chat模型的zero-shot准确率如下，效果同样在同类对齐模型中同样表现较优。
-The zero-shot accuracy of Qwen-1.8B-Chat on MMLU is provided below.
 The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
-|          Model           | Avg. Acc. |
-|:------------------------:|:---------:|
-|     **Qwen-7B-Chat**     |   53.9    |
-|    ChatGLM2-12B-Chat     |   52.1    |
-|    Baichuan-13B-Chat     |   52.1    |
-|     InternLM-7B-Chat     |   50.8    |
-|      LLaMA2-7B-Chat      |   47.0    |
-|     ChatGLM2-6B-Chat     |   45.5    |
-|    **Qwen-1.8B-Chat**    |   43.3    |
-|   OpenLLaMA-Chinese-3B   |   25.7    |
-|       OpenBuddy-3B       |   25.5    |
-| RedPajama-INCITE-Chat-3B |   25.5    |
-|    Firefly-Bloom-1B4     |   23.8    |
 ### 代码评测（Coding Evaluation）
@@ -345,16 +342,16 @@ The zero-shot Pass@1 of Qwen-1.8B-Chat on [HumanEval](https://github.com/openai/
 |          Model           | Pass@1 |
 |:------------------------:|:------:|
-|     **Qwen-7B-Chat**     |  24.4  |
-|     LLaMA2-13B-Chat      |  18.9  |
-|    Baichuan-13B-Chat     |  16.5  |
-|     InternLM-7B-Chat     |  14.0  |
-|      LLaMA2-7B-Chat      |  12.2  |
-|    **Qwen-1.8B-Chat**    |  26.2  |
-|       OpenBuddy-3B       |  10.4  |
-| RedPajama-INCITE-Chat-3B |  6.1   |
-|   OpenLLaMA-Chinese-3B   |  4.9   |
 |    Firefly-Bloom-1B4     |  0.6   |
 ### 数学评测（Mathematics Evaluation）
@@ -362,20 +359,19 @@ The zero-shot Pass@1 of Qwen-1.8B-Chat on [HumanEval](https://github.com/openai/
 The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
-|          Model           | Zero-shot Acc. | 4-shot Acc. |
-|:------------------------:|:--------------:|:-----------:|
-|     **Qwen-7B-Chat**     |      41.1      |    43.5     |
-|    ChatGLM2-12B-Chat     |       -        |    38.1     |
-|    Baichuan-13B-Chat     |       -        |    36.3     |
-|     InternLM-7B-Chat     |      32.6      |    34.5     |
-|     LLaMA2-13B-Chat      |      29.4      |    36.7     |
-|    **Qwen-1.8B-Chat**    |      33.7      |    30.2     |
-|      LLaMA2-7B-Chat      |      20.4      |    28.2     |
-|     ChatGLM2-6B-Chat     |       -        |    28.0     |
-|       OpenBuddy-3B       |      10.6      |    12.6     |
-|   OpenLLaMA-Chinese-3B   |      2.6       |     3.0     |
-| RedPajama-INCITE-Chat-3B |      2.5       |     2.5     |
-|    Firefly-Bloom-1B4     |      2.4       |     1.8     |
 ## 评测复现（Reproduction）

 #### C-Eval
+在[C-Eval](https://arxiv.org/abs/2305.08322)验证集上，我们评价了Qwen-1.8B-Chat模型的准确率
+We demonstrate the accuracy of Qwen-1.8B-Chat on C-Eval validation set
+|          Model                   |    Acc.   |
+|:--------------------------------:|:---------:|
+| RedPajama-INCITE-Chat-3B         |   18.3    |
+|       OpenBuddy-3B               |   23.5    |
+|    Firefly-Bloom-1B4             |   23.6    |
+|   OpenLLaMA-Chinese-3B           |   24.4    |
+|          LLaMA2-7B-Chat          |   31.9    |
+|         ChatGLM2-6B-Chat         |   52.6    |
+|         InternLM-7B-Chat         |   53.6    |
+|    **Qwen-1.8B-Chat (0-shot)**   |   55.6    |
+|    **Qwen-7B-Chat (0-shot)**     |   59.7    |
+|    **Qwen-7B-Chat (5-shot)**     |   59.3    |
 C-Eval测试集上，Qwen-1.8B-Chat模型的zero-shot准确率结果如下：
 | Model                   |   Avg.   | STEM | Social Sciences | Humanities | Others |
 | :---------------------: | :------: | :--: | :-------------: | :--------: | :----: |
 | Chinese-Alpaca-Plus-13B |   41.5   | 36.6 |      49.7       |    43.1    |  41.2  |
 | Chinese-Alpaca-2-7B     |   40.3   |  -   |        -        |     -      |   -    |
+| ChatGLM2-6B-Chat        |   50.1   | 46.4 |      60.4       |    50.6    |  46.9  |
+| Baichuan-13B-Chat       |   51.5   | 43.7 |      64.6       |    56.2    |  49.2  |
+| **Qwen-1.8B-Chat**      |   53.8   | 48.4 |      68.0       |    56.5    |  48.3  |
+| **Qwen-7B-Chat**        |   58.6   | 53.3 |      72.1       |    62.8    |  52.0  |
 ### 英文评测（English Evaluation）
 #### MMLU
+[MMLU](https://arxiv.org/abs/2009.03300)评测集上，Qwen-1.8B-Chat模型的准确率如下，效果同样在同类对齐模型中同样表现较优。
+The accuracy of Qwen-1.8B-Chat on MMLU is provided below.
 The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
+|          Model                   |   Acc.    |
+|:--------------------------------:|:---------:|
+|    Firefly-Bloom-1B4             |   23.8    |
+|       OpenBuddy-3B               |   25.5    |
+| RedPajama-INCITE-Chat-3B         |   25.5    |
+|   OpenLLaMA-Chinese-3B           |   25.7    |
+|         ChatGLM2-6B-Chat         |   46.0    |
+|          LLaMA2-7B-Chat          |   46.2    |
+|         InternLM-7B-Chat         |   51.1    |
+|        Baichuan2-7B-Chat         |   52.9    |
+|    **Qwen-1.8B-Chat (0-shot)**   |   43.3    |
+|    **Qwen-7B-Chat (0-shot)**     |   55.8    |
+|    **Qwen-7B-Chat (5-shot)**     |   57.0    |
 ### 代码评测（Coding Evaluation）
 |          Model           | Pass@1 |
 |:------------------------:|:------:|
 |    Firefly-Bloom-1B4     |  0.6   |
+|   OpenLLaMA-Chinese-3B   |  4.9   |
+| RedPajama-INCITE-Chat-3B |  6.1   |
+|       OpenBuddy-3B       |  10.4  |
+|    ChatGLM2-6B-Chat      |  11.0  |
+|     LLaMA2-7B-Chat       |  12.2  |
+|    Baichuan2-7B-Chat     |  13.4  |
+|    InternLM-7B-Chat      |  14.6  |
+|    **Qwen-1.8B-Chat**    |  26.2  |
+|    **Qwen-7B-Chat**      |  37.2  |
 ### 数学评测（Mathematics Evaluation）
 The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
+|                 Model                |    Acc.  |
+|:------------------------------------:|:--------:|
+|         Firefly-Bloom-1B4            |   2.4    |
+|      RedPajama-INCITE-Chat-3B        |   2.5    |
+|         OpenLLaMA-Chinese-3B         |   3.0    |
+|            OpenBuddy-3B              |   12.6   |
+|            LLaMA2-7B-Chat            |   26.3   |
+|           ChatGLM2-6B-Chat           |   28.8   |
+|          Baichuan2-7B-Chat           |   32.8   |
+|           InternLM-7B-Chat           |   33.0   |
+|    **Qwen-1.8B-Chat (0-shot)**       |   33.7   |
+|      **Qwen-7B-Chat (0-shot)**       |   50.3   |
+|      **Qwen-7B-Chat (8-shot)**       |   54.1   |
 ## 评测复现（Reproduction）