underspirit commited on
Commit
d69da39
1 Parent(s): 5605492

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +25 -25
README.md CHANGED
@@ -18,7 +18,7 @@ inference: false
18
  **XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),使用混合专家模型(MoE,Mixture-of-experts)架构,模型的总参数规模为 258 亿,实际激活的参数量为 42 亿,本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**,主要特点如下:
19
 
20
  - **模型结构**:XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构,将密集模型的 FFN 层扩展为专家层,不同于传统 MoE 中每个专家的大小与标准 FFN 相同(如Mixtral 8x7B ),使用了更细粒度的专家,每个专家是标准 FFN 大小的 1/4,并设置了共享专家(Shared Expert)和非共享专家(Non-shared Expert)两类,共享专家在计算时始终被激活,非共享专家通过 Router 选择性激活。
21
- - **训练数据**:构建了 3.2 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果;模型使用 8K 长度的训练样本进行训练。
22
  - **训练框架**:针对 MoE 模型中独有的专家路由和权重计算逻辑,进行了深入定制优化,开发出一套高效的融合算子,以提升计算效率。同时,为解决 MoE 模型显存占用和通信量大的挑战,设计了计算、通信和 CPU-Offload 的 Overlap 处理方式,从而提高整体吞吐量。
23
 
24
  **XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下:
@@ -45,18 +45,18 @@ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are sh
45
 
46
  为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
47
 
48
- | 数据集 | XVERSE-MoE-A4.2B-2.7T | XVERSE-13B-2-2.7T | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B |
49
- | ------------------------ | :-------------------: | :---------------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: |
50
- | C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 |
51
- | CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 |
52
- | Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 |
53
- | MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 |
54
- | AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 |
55
- | RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 |
56
- | CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 |
57
- | PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 |
58
- | GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 |
59
- | HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 |
60
 
61
  > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
62
 
@@ -67,18 +67,18 @@ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are sh
67
 
68
  To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
69
 
70
- | Dataset | XVERSE-MoE-A4.2B-2.7T | XVERSE-13B-2-2.7T | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B |
71
- | ------------------------ | :-------------------: | :---------------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: |
72
- | C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 |
73
- | CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 |
74
- | Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 |
75
- | MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 |
76
- | AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 |
77
- | RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 |
78
- | CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 |
79
- | PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 |
80
- | GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 |
81
- | HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 |
82
 
83
  > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
84
 
 
18
  **XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型(Large Language Model),使用混合专家模型(MoE,Mixture-of-experts)架构,模型的总参数规模为 258 亿,实际激活的参数量为 42 亿,本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**,主要特点如下:
19
 
20
  - **模型结构**:XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构,将密集模型的 FFN 层扩展为专家层,不同于传统 MoE 中每个专家的大小与标准 FFN 相同(如Mixtral 8x7B ),使用了更细粒度的专家,每个专家是标准 FFN 大小的 1/4,并设置了共享专家(Shared Expert)和非共享专家(Non-shared Expert)两类,共享专家在计算时始终被激活,非共享专家通过 Router 选择性激活。
21
+ - **训练数据**:构建了 2.7 万亿 token 的高质量、多样化的数据对模型进行充分训练,包含中、英、俄、西等 40 多种语言,通过精细化设置不同类型数据的采样比例,使得中英两种语言表现优异,也能兼顾其他语言效果;模型使用 8K 长度的训练样本进行训练。
22
  - **训练框架**:针对 MoE 模型中独有的专家路由和权重计算逻辑,进行了深入定制优化,开发出一套高效的融合算子,以提升计算效率。同时,为解决 MoE 模型显存占用和通信量大的挑战,设计了计算、通信和 CPU-Offload 的 Overlap 处理方式,从而提高整体吞吐量。
23
 
24
  **XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下:
 
45
 
46
  为了综合评估模型的性能,我们在一系列标准数据集上进行了全面测试,包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力,具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下:
47
 
48
+ | 数据集 | XVERSE-MoE-A4.2B | XVERSE-13B-2 | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B | DeepSeek-MoE-16B |
49
+ | ------------------------ | :--------------: | :----------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: | :--------------: |
50
+ | C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 | 40.6 |
51
+ | CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 | 42.5 |
52
+ | Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 | 29.1 |
53
+ | MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 | 45 |
54
+ | AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 | 31.7 |
55
+ | RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 | 61.9 |
56
+ | CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 | 54.8 |
57
+ | PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 | 80.2 |
58
+ | GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 | 18.8 |
59
+ | HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 | 26.8 |
60
 
61
  > <sup>1:只针对其中的单项选择题进行测试,即排除了填空题、开放性问题和多项选择题</sup>
62
 
 
67
 
68
  To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
69
 
70
+ | Dataset | XVERSE-MoE-A4.2B | XVERSE-13B-2 | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B | DeepSeek-MoE-16B |
71
+ | ------------------------ | :--------------: | :----------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: | :--------------: |
72
+ | C-Eval | 60.5 | 62.0 | 58.1 | 35.6 | 38.8 | 57.1 | 45.0 | 45.1 | 50.0 | 40.6 |
73
+ | CMMLU | 64.5 | 65.4 | 62.0 | 38.4 | 40.6 | 61.3 | 47.2 | 44.9 | 50.5 | 42.5 |
74
+ | Gaokao-Bench<sup>1</sup> | 60.3 | 65.3 | 54.3 | 35.4 | 38.9 | 61.7 | 35.4 | 40.2 | 42.3 | 29.1 |
75
+ | MMLU | 60.2 | 60.0 | 59.2 | 54.8 | 63.4 | 56.6 | 48.2 | 62.5 | 64.3 | 45 |
76
+ | AGIEval<sup>1</sup> | 48.0 | 52.4 | 48.2 | 33.4 | 42.4 | 46.9 | 26.4 | 41.2 | 41.7 | 31.7 |
77
+ | RACE-M | 75.4 | 82.4 | 68.9 | 63.0 | 67.9 | 79.0 | 63.2 | 67.5 | 80.2 | 61.9 |
78
+ | CommonSenseQA | 70.0 | 68.0 | 65.6 | 67.3 | 74.0 | 64.1 | 56.4 | 68.8 | 74.0 | 54.8 |
79
+ | PIQA | 81.4 | 79.8 | 78.5 | 80.5 | 82.8 | 76.7 | 79.2 | 82.2 | 81.2 | 80.2 |
80
+ | GSM8K | 51.2 | 52.7 | 52.7 | 28.7 | 50.9 | 19.3 | 17.4 | 35.4 | 46.4 | 18.8 |
81
+ | HumanEval | 29.9 | 32.3 | 17.1 | 18.3 | 23.7 | 10.4 | 26.2 | 26.2 | 32.3 | 26.8 |
82
 
83
  > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>
84