xverse
/

XVERSE-MoE-A4.2B

Text Generation

Transformers

PyTorch

xverse

custom_code

Model card Files Files and versions Community

underspirit commited on Apr 1

Commit

d69da39

•

1 Parent(s): 5605492

Update README.md

Browse files

Files changed (1) hide show

README.md +25 -25

README.md CHANGED Viewed

@@ -18,7 +18,7 @@ inference: false
 **XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型（Large Language Model），使用混合专家模型（MoE，Mixture-of-experts）架构，模型的总参数规模为 258 亿，实际激活的参数量为 42 亿，本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**，主要特点如下：
 - **模型结构**：XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构，将密集模型的 FFN 层扩展为专家层，不同于传统 MoE 中每个专家的大小与标准 FFN 相同（如Mixtral 8x7B ），使用了更细粒度的专家，每个专家是标准 FFN 大小的 1/4，并设置了共享专家（Shared Expert）和非共享专家（Non-shared Expert）两类，共享专家在计算时始终被激活，非共享专家通过 Router 选择性激活。
-- **训练数据**：构建了 3.2 万亿 token 的高质量、多样化的数据对模型进行充分训练，包含中、英、俄、西等 40 多种语言，通过精细化设置不同类型数据的采样比例，使得中英两种语言表现优异，也能兼顾其他语言效果；模型使用 8K 长度的训练样本进行训练。
 - **训练框架**：针对 MoE 模型中独有的专家路由和权重计算逻辑，进行了深入定制优化，开发出一套高效的融合算子，以提升计算效率。同时，为解决 MoE 模型显存占用和通信量大的挑战，设计了计算、通信和 CPU-Offload 的 Overlap 处理方式，从而提高整体吞吐量。
 **XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下：
@@ -45,18 +45,18 @@ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are sh
 为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力，具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下：
-| 数据集                   | XVERSE-MoE-A4.2B-2.7T | XVERSE-13B-2-2.7T | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B |
-| ------------------------ | :-------------------: | :---------------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: |
-| C-Eval                   |         60.5          |       62.0        |     58.1      |    35.6    |    38.8    |   57.1    |    45.0     |    45.1    |   50.0   |
-| CMMLU                    |         64.5          |       65.4        |     62.0      |    38.4    |    40.6    |   61.3    |    47.2     |    44.9    |   50.5   |
-| Gaokao-Bench<sup>1</sup> |         60.3          |       65.3        |     54.3      |    35.4    |    38.9    |   61.7    |    35.4     |    40.2    |   42.3   |
-| MMLU                     |         60.2          |       60.0        |     59.2      |    54.8    |    63.4    |   56.6    |    48.2     |    62.5    |   64.3   |
-| AGIEval<sup>1</sup>      |         48.0          |       52.4        |     48.2      |    33.4    |    42.4    |   46.9    |    26.4     |    41.2    |   41.7   |
-| RACE-M                   |         75.4          |       82.4        |     68.9      |    63.0    |    67.9    |   79.0    |    63.2     |    67.5    |   80.2   |
-| CommonSenseQA            |         70.0          |       68.0        |     65.6      |    67.3    |    74.0    |   64.1    |    56.4     |    68.8    |   74.0   |
-| PIQA                     |         81.4          |       79.8        |     78.5      |    80.5    |    82.8    |   76.7    |    79.2     |    82.2    |   81.2   |
-| GSM8K                    |         51.2          |       52.7        |     52.7      |    28.7    |    50.9    |   19.3    |    17.4     |    35.4    |   46.4   |
-| HumanEval                |         29.9          |       32.3        |     17.1      |    18.3    |    23.7    |   10.4    |    26.2     |    26.2    |   32.3   |
 > <sup>1：只针对其中的单项选择题进行测试，即排除了填空题、开放性问题和多项选择题</sup>
@@ -67,18 +67,18 @@ The models sizes, architectures and learning rate of **XVERSE-MoE-A4.2B** are sh
 To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
-| Dataset                  | XVERSE-MoE-A4.2B-2.7T | XVERSE-13B-2-2.7T | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B |
-| ------------------------ | :-------------------: | :---------------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: |
-| C-Eval                   |         60.5          |       62.0        |     58.1      |    35.6    |    38.8    |   57.1    |    45.0     |    45.1    |   50.0   |
-| CMMLU                    |         64.5          |       65.4        |     62.0      |    38.4    |    40.6    |   61.3    |    47.2     |    44.9    |   50.5   |
-| Gaokao-Bench<sup>1</sup> |         60.3          |       65.3        |     54.3      |    35.4    |    38.9    |   61.7    |    35.4     |    40.2    |   42.3   |
-| MMLU                     |         60.2          |       60.0        |     59.2      |    54.8    |    63.4    |   56.6    |    48.2     |    62.5    |   64.3   |
-| AGIEval<sup>1</sup>      |         48.0          |       52.4        |     48.2      |    33.4    |    42.4    |   46.9    |    26.4     |    41.2    |   41.7   |
-| RACE-M                   |         75.4          |       82.4        |     68.9      |    63.0    |    67.9    |   79.0    |    63.2     |    67.5    |   80.2   |
-| CommonSenseQA            |         70.0          |       68.0        |     65.6      |    67.3    |    74.0    |   64.1    |    56.4     |    68.8    |   74.0   |
-| PIQA                     |         81.4          |       79.8        |     78.5      |    80.5    |    82.8    |   76.7    |    79.2     |    82.2    |   81.2   |
-| GSM8K                    |         51.2          |       52.7        |     52.7      |    28.7    |    50.9    |   19.3    |    17.4     |    35.4    |   46.4   |
-| HumanEval                |         29.9          |       32.3        |     17.1      |    18.3    |    23.7    |   10.4    |    26.2     |    26.2    |   32.3   |
 > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>

 **XVERSE-MoE-A4.2B** 是由深圳元象科技自主研发的支持多语言的大语言模型（Large Language Model），使用混合专家模型（MoE，Mixture-of-experts）架构，模型的总参数规模为 258 亿，实际激活的参数量为 42 亿，本次开源的模型为底座模型 **XVERSE-MoE-A4.2B**，主要特点如下：
 - **模型结构**：XVERSE-MoE-A4.2B 为 Decoder-only 的 Transformer 架构，将密集模型的 FFN 层扩展为专家层，不同于传统 MoE 中每个专家的大小与标准 FFN 相同（如Mixtral 8x7B ），使用了更细粒度的专家，每个专家是标准 FFN 大小的 1/4，并设置了共享专家（Shared Expert）和非共享专家（Non-shared Expert）两类，共享专家在计算时始终被激活，非共享专家通过 Router 选择性激活。
+- **训练数据**：构建了 2.7 万亿 token 的高质量、多样化的数据对模型进行充分训练，包含中、英、俄、西等 40 多种语言，通过精细化设置不同类型数据的采样比例，使得中英两种语言表现优异，也能兼顾其他语言效果；模型使用 8K 长度的训练样本进行训练。
 - **训练框架**：针对 MoE 模型中独有的专家路由和权重计算逻辑，进行了深入定制优化，开发出一套高效的融合算子，以提升计算效率。同时，为解决 MoE 模型显存占用和通信量大的挑战，设计了计算、通信和 CPU-Offload 的 Overlap 处理方式，从而提高整体吞吐量。
 **XVERSE-MoE-A4.2B** 的模型大小、架构和学习率如下：
 为了综合评估模型的性能，我们在一系列标准数据集上进行了全面测试，包括C-Eval、CMMLU、Gaokao-Bench、MMLU、AGIEval、RACE-M、CommonSenseQA、PIQA、GSM8K和HumanEval。这些评估覆盖了模型在多个领域的能力，具体包括中文问答、英文问答、语言理解、常识问答、逻辑推理、数学问题解答以及编程能力。评估结果如下：
+| 数据集                   | XVERSE-MoE-A4.2B | XVERSE-13B-2 | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B | DeepSeek-MoE-16B |
+| ------------------------ | :--------------: | :----------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: | :--------------: |
+| C-Eval                   |       60.5       |     62.0     |     58.1      |    35.6    |    38.8    |   57.1    |    45.0     |    45.1    |   50.0   |       40.6       |
+| CMMLU                    |       64.5       |     65.4     |     62.0      |    38.4    |    40.6    |   61.3    |    47.2     |    44.9    |   50.5   |       42.5       |
+| Gaokao-Bench<sup>1</sup> |       60.3       |     65.3     |     54.3      |    35.4    |    38.9    |   61.7    |    35.4     |    40.2    |   42.3   |       29.1       |
+| MMLU                     |       60.2       |     60.0     |     59.2      |    54.8    |    63.4    |   56.6    |    48.2     |    62.5    |   64.3   |        45        |
+| AGIEval<sup>1</sup>      |       48.0       |     52.4     |     48.2      |    33.4    |    42.4    |   46.9    |    26.4     |    41.2    |   41.7   |       31.7       |
+| RACE-M                   |       75.4       |     82.4     |     68.9      |    63.0    |    67.9    |   79.0    |    63.2     |    67.5    |   80.2   |       61.9       |
+| CommonSenseQA            |       70.0       |     68.0     |     65.6      |    67.3    |    74.0    |   64.1    |    56.4     |    68.8    |   74.0   |       54.8       |
+| PIQA                     |       81.4       |     79.8     |     78.5      |    80.5    |    82.8    |   76.7    |    79.2     |    82.2    |   81.2   |       80.2       |
+| GSM8K                    |       51.2       |     52.7     |     52.7      |    28.7    |    50.9    |   19.3    |    17.4     |    35.4    |   46.4   |       18.8       |
+| HumanEval                |       29.9       |     32.3     |     17.1      |    18.3    |    23.7    |   10.4    |    26.2     |    26.2    |   32.3   |       26.8       |
 > <sup>1：只针对其中的单项选择题进行测试，即排除了填空题、开放性问题和多项选择题</sup>
 To comprehensively assess the performance of the model, we conducted extensive testing across a range of standard datasets, including C-Eval, CMMLU, Gaokao-Bench, MMLU, AGIEval, RACE-M, CommonSenseQA, PIQA, GSM8K and HumanEval. These evaluations spanned multiple capabilities of the model, specifically including Chinese question answering, English question answering, language comprehension, common sense questioning, logical reasoning, mathematical problem-solving, and coding ability. The results of the evaluations are as follows:
+| Dataset                  | XVERSE-MoE-A4.2B | XVERSE-13B-2 | Baichuan2-13B | Llama2-13B | Llama1-65B | XVERSE-7B | DeepSeek-7B | Mistral-7B | Gemma-7B | DeepSeek-MoE-16B |
+| ------------------------ | :--------------: | :----------: | :-----------: | :--------: | :--------: | :-------: | :---------: | :--------: | :------: | :--------------: |
+| C-Eval                   |       60.5       |     62.0     |     58.1      |    35.6    |    38.8    |   57.1    |    45.0     |    45.1    |   50.0   |       40.6       |
+| CMMLU                    |       64.5       |     65.4     |     62.0      |    38.4    |    40.6    |   61.3    |    47.2     |    44.9    |   50.5   |       42.5       |
+| Gaokao-Bench<sup>1</sup> |       60.3       |     65.3     |     54.3      |    35.4    |    38.9    |   61.7    |    35.4     |    40.2    |   42.3   |       29.1       |
+| MMLU                     |       60.2       |     60.0     |     59.2      |    54.8    |    63.4    |   56.6    |    48.2     |    62.5    |   64.3   |        45        |
+| AGIEval<sup>1</sup>      |       48.0       |     52.4     |     48.2      |    33.4    |    42.4    |   46.9    |    26.4     |    41.2    |   41.7   |       31.7       |
+| RACE-M                   |       75.4       |     82.4     |     68.9      |    63.0    |    67.9    |   79.0    |    63.2     |    67.5    |   80.2   |       61.9       |
+| CommonSenseQA            |       70.0       |     68.0     |     65.6      |    67.3    |    74.0    |   64.1    |    56.4     |    68.8    |   74.0   |       54.8       |
+| PIQA                     |       81.4       |     79.8     |     78.5      |    80.5    |    82.8    |   76.7    |    79.2     |    82.2    |   81.2   |       80.2       |
+| GSM8K                    |       51.2       |     52.7     |     52.7      |    28.7    |    50.9    |   19.3    |    17.4     |    35.4    |   46.4   |       18.8       |
+| HumanEval                |       29.9       |     32.3     |     17.1      |    18.3    |    23.7    |   10.4    |    26.2     |    26.2    |   32.3   |       26.8       |
 > <sup>1: Tests are conducted only on single-answer multiple-choice questions, thus excluding fill-in-the-blanks, open-ended questions, and multiple-answer multiple-choice questions.</sup>