yangapku commited on
Commit
63f5577
·
1 Parent(s): 9b22fd2

update readme

Browse files
Files changed (1) hide show
  1. README.md +57 -61
README.md CHANGED
@@ -281,25 +281,22 @@ Note: Due to rounding errors caused by hardware and framework, differences in re
281
 
282
  #### C-Eval
283
 
284
- 在[C-Eval](https://arxiv.org/abs/2305.08322)验证集上,我们评价了Qwen-1.8B-Chat模型的zero-shot准确率
285
-
286
- We demonstrate the zero-shot accuracy of Qwen-1.8B-Chat on C-Eval validation set
287
-
288
- | Model | Avg. Acc. |
289
- |:------------------------:|:---------:|
290
- | **Qwen-7B-Chat** | 54.2 |
291
- | InternLM-7B-Chat | 53.2 |
292
- | **Qwen-1.8B-Chat** | 55.6 |
293
- | ChatGLM2-6B-Chat | 50.7 |
294
- | Baichuan-13B-Chat | 50.4 |
295
- | Chinese-Alpaca-Plus-13B | 43.3 |
296
- | Chinese-Alpaca-2-7B | 41.3 |
297
- | LLaMA2-13B-Chat | 40.6 |
298
- | LLaMA2-7B-Chat | 31.9 |
299
- | OpenLLaMA-Chinese-3B | 24.4 |
300
- | Firefly-Bloom-1B4 | 23.6 |
301
- | OpenBuddy-3B | 23.5 |
302
- | RedPajama-INCITE-Chat-3B | 18.3 |
303
 
304
  C-Eval测试集上,Qwen-1.8B-Chat模型的zero-shot准确率结果如下:
305
 
@@ -307,35 +304,35 @@ The zero-shot accuracy of Qwen-1.8B-Chat on C-Eval testing set is provided below
307
 
308
  | Model | Avg. | STEM | Social Sciences | Humanities | Others |
309
  | :---------------------: | :------: | :--: | :-------------: | :--------: | :----: |
310
- | **Qwen-7B-Chat** | 54.6 | 47.8 | 67.6 | 59.3 | 50.6 |
311
- | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
312
- | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
313
- | **Qwen-1.8B-Chat** | 53.8 | 48.4 | 68.0 | 56.5 | 48.3 |
314
  | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
315
  | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
 
 
 
 
316
 
317
  ### 英文评测(English Evaluation)
318
 
319
  #### MMLU
320
 
321
- [MMLU](https://arxiv.org/abs/2009.03300)评测集上,Qwen-1.8B-Chat模型的zero-shot准确率如下,效果同样在同类对齐模型中同样表现较优。
322
 
323
- The zero-shot accuracy of Qwen-1.8B-Chat on MMLU is provided below.
324
  The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
325
 
326
- | Model | Avg. Acc. |
327
- |:------------------------:|:---------:|
328
- | **Qwen-7B-Chat** | 53.9 |
329
- | ChatGLM2-12B-Chat | 52.1 |
330
- | Baichuan-13B-Chat | 52.1 |
331
- | InternLM-7B-Chat | 50.8 |
332
- | LLaMA2-7B-Chat | 47.0 |
333
- | ChatGLM2-6B-Chat | 45.5 |
334
- | **Qwen-1.8B-Chat** | 43.3 |
335
- | OpenLLaMA-Chinese-3B | 25.7 |
336
- | OpenBuddy-3B | 25.5 |
337
- | RedPajama-INCITE-Chat-3B | 25.5 |
338
- | Firefly-Bloom-1B4 | 23.8 |
339
 
340
  ### 代码评测(Coding Evaluation)
341
 
@@ -345,16 +342,16 @@ The zero-shot Pass@1 of Qwen-1.8B-Chat on [HumanEval](https://github.com/openai/
345
 
346
  | Model | Pass@1 |
347
  |:------------------------:|:------:|
348
- | **Qwen-7B-Chat** | 24.4 |
349
- | LLaMA2-13B-Chat | 18.9 |
350
- | Baichuan-13B-Chat | 16.5 |
351
- | InternLM-7B-Chat | 14.0 |
352
- | LLaMA2-7B-Chat | 12.2 |
353
- | **Qwen-1.8B-Chat** | 26.2 |
354
- | OpenBuddy-3B | 10.4 |
355
- | RedPajama-INCITE-Chat-3B | 6.1 |
356
- | OpenLLaMA-Chinese-3B | 4.9 |
357
  | Firefly-Bloom-1B4 | 0.6 |
 
 
 
 
 
 
 
 
 
358
 
359
  ### 数学评测(Mathematics Evaluation)
360
 
@@ -362,20 +359,19 @@ The zero-shot Pass@1 of Qwen-1.8B-Chat on [HumanEval](https://github.com/openai/
362
 
363
  The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
364
 
365
- | Model | Zero-shot Acc. | 4-shot Acc. |
366
- |:------------------------:|:--------------:|:-----------:|
367
- | **Qwen-7B-Chat** | 41.1 | 43.5 |
368
- | ChatGLM2-12B-Chat | - | 38.1 |
369
- | Baichuan-13B-Chat | - | 36.3 |
370
- | InternLM-7B-Chat | 32.6 | 34.5 |
371
- | LLaMA2-13B-Chat | 29.4 | 36.7 |
372
- | **Qwen-1.8B-Chat** | 33.7 | 30.2 |
373
- | LLaMA2-7B-Chat | 20.4 | 28.2 |
374
- | ChatGLM2-6B-Chat | - | 28.0 |
375
- | OpenBuddy-3B | 10.6 | 12.6 |
376
- | OpenLLaMA-Chinese-3B | 2.6 | 3.0 |
377
- | RedPajama-INCITE-Chat-3B | 2.5 | 2.5 |
378
- | Firefly-Bloom-1B4 | 2.4 | 1.8 |
379
 
380
  ## 评测复现(Reproduction)
381
 
 
281
 
282
  #### C-Eval
283
 
284
+ 在[C-Eval](https://arxiv.org/abs/2305.08322)验证集上,我们评价了Qwen-1.8B-Chat模型的准确率
285
+
286
+ We demonstrate the accuracy of Qwen-1.8B-Chat on C-Eval validation set
287
+
288
+ | Model | Acc. |
289
+ |:--------------------------------:|:---------:|
290
+ | RedPajama-INCITE-Chat-3B | 18.3 |
291
+ | OpenBuddy-3B | 23.5 |
292
+ | Firefly-Bloom-1B4 | 23.6 |
293
+ | OpenLLaMA-Chinese-3B | 24.4 |
294
+ | LLaMA2-7B-Chat | 31.9 |
295
+ | ChatGLM2-6B-Chat | 52.6 |
296
+ | InternLM-7B-Chat | 53.6 |
297
+ | **Qwen-1.8B-Chat (0-shot)** | 55.6 |
298
+ | **Qwen-7B-Chat (0-shot)** | 59.7 |
299
+ | **Qwen-7B-Chat (5-shot)** | 59.3 |
 
 
 
300
 
301
  C-Eval测试集上,Qwen-1.8B-Chat模型的zero-shot准确率结果如下:
302
 
 
304
 
305
  | Model | Avg. | STEM | Social Sciences | Humanities | Others |
306
  | :---------------------: | :------: | :--: | :-------------: | :--------: | :----: |
 
 
 
 
307
  | Chinese-Alpaca-Plus-13B | 41.5 | 36.6 | 49.7 | 43.1 | 41.2 |
308
  | Chinese-Alpaca-2-7B | 40.3 | - | - | - | - |
309
+ | ChatGLM2-6B-Chat | 50.1 | 46.4 | 60.4 | 50.6 | 46.9 |
310
+ | Baichuan-13B-Chat | 51.5 | 43.7 | 64.6 | 56.2 | 49.2 |
311
+ | **Qwen-1.8B-Chat** | 53.8 | 48.4 | 68.0 | 56.5 | 48.3 |
312
+ | **Qwen-7B-Chat** | 58.6 | 53.3 | 72.1 | 62.8 | 52.0 |
313
 
314
  ### 英文评测(English Evaluation)
315
 
316
  #### MMLU
317
 
318
+ [MMLU](https://arxiv.org/abs/2009.03300)评测集上,Qwen-1.8B-Chat模型的准确率如下,效果同样在同类对齐模型中同样表现较优。
319
 
320
+ The accuracy of Qwen-1.8B-Chat on MMLU is provided below.
321
  The performance of Qwen-1.8B-Chat still on the top between other human-aligned models with comparable size.
322
 
323
+ | Model | Acc. |
324
+ |:--------------------------------:|:---------:|
325
+ | Firefly-Bloom-1B4 | 23.8 |
326
+ | OpenBuddy-3B | 25.5 |
327
+ | RedPajama-INCITE-Chat-3B | 25.5 |
328
+ | OpenLLaMA-Chinese-3B | 25.7 |
329
+ | ChatGLM2-6B-Chat | 46.0 |
330
+ | LLaMA2-7B-Chat | 46.2 |
331
+ | InternLM-7B-Chat | 51.1 |
332
+ | Baichuan2-7B-Chat | 52.9 |
333
+ | **Qwen-1.8B-Chat (0-shot)** | 43.3 |
334
+ | **Qwen-7B-Chat (0-shot)** | 55.8 |
335
+ | **Qwen-7B-Chat (5-shot)** | 57.0 |
336
 
337
  ### 代码评测(Coding Evaluation)
338
 
 
342
 
343
  | Model | Pass@1 |
344
  |:------------------------:|:------:|
 
 
 
 
 
 
 
 
 
345
  | Firefly-Bloom-1B4 | 0.6 |
346
+ | OpenLLaMA-Chinese-3B | 4.9 |
347
+ | RedPajama-INCITE-Chat-3B | 6.1 |
348
+ | OpenBuddy-3B | 10.4 |
349
+ | ChatGLM2-6B-Chat | 11.0 |
350
+ | LLaMA2-7B-Chat | 12.2 |
351
+ | Baichuan2-7B-Chat | 13.4 |
352
+ | InternLM-7B-Chat | 14.6 |
353
+ | **Qwen-1.8B-Chat** | 26.2 |
354
+ | **Qwen-7B-Chat** | 37.2 |
355
 
356
  ### 数学评测(Mathematics Evaluation)
357
 
 
359
 
360
  The accuracy of Qwen-1.8B-Chat on GSM8K is shown below
361
 
362
+ | Model | Acc. |
363
+ |:------------------------------------:|:--------:|
364
+ | Firefly-Bloom-1B4 | 2.4 |
365
+ | RedPajama-INCITE-Chat-3B | 2.5 |
366
+ | OpenLLaMA-Chinese-3B | 3.0 |
367
+ | OpenBuddy-3B | 12.6 |
368
+ | LLaMA2-7B-Chat | 26.3 |
369
+ | ChatGLM2-6B-Chat | 28.8 |
370
+ | Baichuan2-7B-Chat | 32.8 |
371
+ | InternLM-7B-Chat | 33.0 |
372
+ | **Qwen-1.8B-Chat (0-shot)** | 33.7 |
373
+ | **Qwen-7B-Chat (0-shot)** | 50.3 |
374
+ | **Qwen-7B-Chat (8-shot)** | 54.1 |
 
375
 
376
  ## 评测复现(Reproduction)
377