xverse
/

XVERSE-65B

Text Generation

Transformers

PyTorch

xverse

custom_code

Model card Files Files and versions Community

underspirit commited on Nov 6, 2023

Commit

7f1b739

1 Parent(s): f2b711f

Update README.md

Browse files

Files changed (1) hide show

README.md +2 -2

README.md CHANGED Viewed

@@ -14,7 +14,7 @@ inference: false
 - **模型结构**：XVERSE-65B 使用主流 Decoder-only 的标准 Transformer 网络结构，支持 16K 的上下文长度（Context Length），能满足更长的多轮对话、知识问答与摘要等需求，模型应用场景更广泛。
 - **训练数据**：构建了 2.6 万亿 token 的高质量、多样化的数据对模型进行充分训练，包含中、英、俄、西等 40 多种语言，通过精细化设置不同类型数据的采样比例，使得中英两种语言表现优异，也能兼顾其他语言效果。
 - **分词**：基于 BPE（Byte-Pair Encoding）算法，使用上百 GB 语料训练了一个词表大小为 100,534 的分词器，能够同时支持多语言，而无需额外扩展词表。
-- **训练框架**：自主研发多项关键技术，包括高效算子、显存优化、并行调度策略、数据-计算-通信重叠、平台和框架协同等，让训练效率更高，模型稳定性强，在千卡集群上的峰值算力利用率位居业界前列。
 ## Model Introduction
@@ -23,7 +23,7 @@ inference: false
 - **Model Structure**: XVERSE-65B uses the mainstream Decoder-only Transformer network structure, supports 16k context length, which can meet the need of longer multi-round dialogues, knowledge question-answering, and summarization. This makes the model more versatile in application scenarios.
 - **Training Data**: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 2.6 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages.
 - **Tokenization**: Based on the BPE (Byte-Pair Encoding) algorithm, a tokenizer with a vocabulary size of 100,534 has been trained using hundreds of gigabytes of language data. This tokenizer is capable of supporting multilingual without the need for additional vocabulary expansion.
-- **Training Framework**: Several key technologies have also been independently developed, including efficient operators, memory optimization, parallel scheduling strategies, overlap of data-computation-communication, and synergy between platforms and frameworks. These advancements enhance training efficiency and model stability. With these technologies, the peak computational power utilization rate on a thousand-card cluster ranks at the forefront of the industry.
 ## 评测结果

 - **模型结构**：XVERSE-65B 使用主流 Decoder-only 的标准 Transformer 网络结构，支持 16K 的上下文长度（Context Length），能满足更长的多轮对话、知识问答与摘要等需求，模型应用场景更广泛。
 - **训练数据**：构建了 2.6 万亿 token 的高质量、多样化的数据对模型进行充分训练，包含中、英、俄、西等 40 多种语言，通过精细化设置不同类型数据的采样比例，使得中英两种语言表现优异，也能兼顾其他语言效果。
 - **分词**：基于 BPE（Byte-Pair Encoding）算法，使用上百 GB 语料训练了一个词表大小为 100,534 的分词器，能够同时支持多语言，而无需额外扩展词表。
+- **训练框架**：训练中采用 FlashAttention2 加速计算，3D 并行基础上采用虚拟流水线（virtual pipeline）技术，降低较长流水线和 16k 上下文窗口产生的过高气泡率，在千卡集群的峰值算力利用率达到业界前列。同时通过集群基础设施运营、资源调度、训练框架和调度平台协同等持续优化，打造出高稳定、低中断、强容错的训练系统，将每周有效训练率提升至 98.6%。
 ## Model Introduction
 - **Model Structure**: XVERSE-65B uses the mainstream Decoder-only Transformer network structure, supports 16k context length, which can meet the need of longer multi-round dialogues, knowledge question-answering, and summarization. This makes the model more versatile in application scenarios.
 - **Training Data**: The model has been thoroughly trained on a diversified and high-quality dataset consisting of 2.6 trillion of tokens, including more than 40 languages such as Chinese, English, Russian, and Spanish. The sampling ratio of different types of data is finely set, which makes the performance of Chinese and English excellent, and also takes into account the effect of other languages.
 - **Tokenization**: Based on the BPE (Byte-Pair Encoding) algorithm, a tokenizer with a vocabulary size of 100,534 has been trained using hundreds of gigabytes of language data. This tokenizer is capable of supporting multilingual without the need for additional vocabulary expansion.
+- **Training Framework**: The training utilizes FlashAttention2 for accelerated computation, and on top of 3D parallelism, virtual pipeline technology is applied to reduce the excessive bubble rate caused by longer pipelines and 16k context windows. This achieves a peak computational efficiency within the industry-leading range in the petaflop-scale cluster. Concurrently, through continuous optimization of cluster infrastructure operations, resource scheduling, training frameworks, and the scheduling platform, a highly stable, low-interruption, and robust fault-tolerant training system has been developed, enhancing the effective weekly training rate to 98.6%.
 ## 评测结果