xverse
/

XVERSE-13B-256K

Text Generation

Transformers

PyTorch

xverse

custom_code

Model card Files Files and versions Community

pom commited on Jan 15, 2024

Commit

94d4537

1 Parent(s): 972bed9

update

Browse files

Files changed (1) hide show

README.md +12 -11

README.md CHANGED Viewed

@@ -8,7 +8,7 @@ inference: false
 # XVERSE-13B-256K
 ## 更新信息
-**[2024/01/16]** 发布长序列对话模型**XVERSE-13B-256K**，该版本模型最大支持256K窗口长度，约25w字的输入内容，可以协助进行文献总结、报告分析等任务。
 **[2023/11/06]** 发布新版本的 **XVERSE-13B-2** 底座模型和 **XVERSE-13B-Chat-2** 对话模型，相较于原始版本，新版本的模型训练更加充分（从 1.4T 增加到 3.2T），各方面的能力均得到大幅提升，同时新增工具调用能力。
 **[2023/09/26]** 发布 7B 尺寸的 [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) 底座模型和 [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 对话模型，支持在单张消费级显卡部署运行，并保持高性能、全开源、免费可商用。
 **[2023/08/22]** 发布经过指令精调的 XVERSE-13B-Chat 对话模型。
@@ -23,30 +23,30 @@ inference: false
 ## 模型介绍
-**XVERSE-13B-256K**是[**XVERSE-13B**](https://huggingface.co/xverse/XVERSE-13B)模型经过ABF+继续预训练、NTK+SFT微调后的版本。
 **XVERSE-13B-256K** 是由深圳元象科技自主研发的支持多语言的大语言模型（Large Language Model），主要应用的技术如下：
-- **ABF**：ABF的全称是Adjusted Base Frequency，表示将位置编码RoPE（Rotary Position Embedding）的频率从10000修改成500000。别小看这个数字的更改，它可以大幅减少前面序列attention的衰减速度，让后面的序列更好地获取所有序列的信息。
-- **继续预训练**：在XVERSE-13B的基础上，使用20%的预训练数据进行32K的长序列继续预训练。通过少量长序列数据的继续预训练而不是从头开始的长序列预训练，可以大幅减少预训练的训练量。
-- **NTK**：NTK的全称是Neural Tangent Kernel，翻译为神经正切核，是一种用于理解和分析深度神经网络行为的工具。使用了NTK的RoPE可以对RoPE的频率进行动态的插值。在保持分辨率的情况下（高频），进行频域空间的缩放（低频），从而实现位置空间的插值。
-- **SFT数据**：自主构建包含单文本问答，多文本问答，摘要，代码补全等各类长序列数据，序列长度从32K到256K不等。
 ## Model Introduction
-**XVERSE-13B-256K** is the long-sequence version of model [**XVERSE-13B**](https://huggingface.co/xverse/XVERSE-13B),
 updated by **Continual-Pre-Training** based on **ABF** and **supervised fine-tuning** based on **NTK**.
 **XVERSE-13B-256K** is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology. Below are the main practical techniques:
 - **ABF**: Adjusted Base Frequency means that changing the frequency of Rotary Position Embedding(RoPE) from 10,000 to 500,000.
-- **Continual-Pre-Training**: Based on XVERSE-13B, 32K long sequence continuation pre-training is conducted using 20% of the pre-training data. This approach significantly reduces the training volume for pre-training by utilizing a small amount of long sequence data for continuation pre-training instead of starting from scratch with long sequence pre-training.
 - **NTK**: Neural Tangent Kernel is a tool used for understanding and analyzing the behavior of deep neural networks. RoPE, employing NTK, enables dynamic interpolation of its frequencies. This involves scaling in the frequency domain while maintaining resolution, thereby achieving spatial interpolation in the positional domain.
 - **Data for SFT**: We autonomously construct a diverse range of long sequence data, encompassing single-document question-answering (QA), multi-document QA, summarization, code completion, and other types. The sequence lengths vary from 32K to 256K.
 ## 评测结果
-为了验证长序列的效果，这里我们使用了LongBench数据集。[LongBench](https://github.com/THUDM/LongBench)是第一个多任务、中英双语、针对大语言模型长文本理解能力的评测基准。LongBench由六大类、二十一个不同的任务组成，覆盖了单文档问答、多文档问答、摘要、Few shot任务、合成任务和代码补全等关键的长文本应用场景。LongBench包含14个英文任务、5个中文任务和2个代码任务，多数任务的平均长度在5k-15k之间，共包含4750条测试数据。评估结果如下：
 |  能力维度  |  数据集 |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
@@ -90,8 +90,9 @@ For all the comparison models mentioned above, we prioritize the disclosure of t
 Environment Setup:
 pip install -r requirements.txt
 可通过以下代码加载 XVERSE-13B-256K 模型进行对话：
@@ -115,7 +116,7 @@ response = model.chat(tokenizer, history)
 print(response)
 ```
-更多细节，包括对话demo、模型微调及量化等，请参考我们的[Github](https://github.com/xverse-ai/XVERSE-13B)。
 For more details, including chat demo, model fine-tuning and quantization, please refer to our [Github](https://github.com/xverse-ai/XVERSE-13B).

 # XVERSE-13B-256K
 ## 更新信息
+**[2024/01/16]** 发布长序列对话模型**XVERSE-13B-256K**，该版本模型最大支持 256K 的上下文窗口长度，约 25w 字的输入内容，可以协助进行文献总结、报告分析等任务。
 **[2023/11/06]** 发布新版本的 **XVERSE-13B-2** 底座模型和 **XVERSE-13B-Chat-2** 对话模型，相较于原始版本，新版本的模型训练更加充分（从 1.4T 增加到 3.2T），各方面的能力均得到大幅提升，同时新增工具调用能力。
 **[2023/09/26]** 发布 7B 尺寸的 [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) 底座模型和 [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 对话模型，支持在单张消费级显卡部署运行，并保持高性能、全开源、免费可商用。
 **[2023/08/22]** 发布经过指令精调的 XVERSE-13B-Chat 对话模型。
 ## 模型介绍
+**XVERSE-13B-256K**是[**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B)模型经过ABF+继续预训练、NTK+SFT 微调后的版本。
 **XVERSE-13B-256K** 是由深圳元象科技自主研发的支持多语言的大语言模型（Large Language Model），主要应用的技术如下：
+- **ABF**： ABF 的全称是 Adjusted Base Frequency，表示将位置编码 RoPE（Rotary Position Embedding）的频率从 10000 修改成 500000 。别小看这个数字的更改，它可以大幅减少前面序列 attention 的衰减速度，让后面的序列更好地获取所有序列的信息。
+- **继续预训练**：在 XVERSE-13B-2 的基础上，使用 20% 的预训练数据进行 32K 的长序列继续预训练。通过少量长序列数据的继续预训练而不是从头开始的长序列预训练，可以大幅减少预训练的训练量。
+- **NTK**： NTK 的全称是 Neural Tangent Kernel，翻译为神经正切核，是一种用于理解和分析深度神经网络行为的工具。使用了 NTK 的 RoPE 可以对 RoPE 的频率进行动态的插值。在保持分辨率的情况下（高频），进行频域空间的缩放（低频），从而实现位置空间的插值。
+- **SFT数据**：自主构建包含单文本问答，多文本问答，摘要，代码补全等各类长序列数据，序列长度从 32K 到 256K 不等。
 ## Model Introduction
+**XVERSE-13B-256K** is the long-sequence version of model [**XVERSE-13B-2**](https://huggingface.co/xverse/XVERSE-13B),
 updated by **Continual-Pre-Training** based on **ABF** and **supervised fine-tuning** based on **NTK**.
 **XVERSE-13B-256K** is a multilingual large language model, independently developed by Shenzhen Yuanxiang Technology. Below are the main practical techniques:
 - **ABF**: Adjusted Base Frequency means that changing the frequency of Rotary Position Embedding(RoPE) from 10,000 to 500,000.
+- **Continual-Pre-Training**: Based on XVERSE-13B-2, 32K long sequence continuation pre-training is conducted using 20% of the pre-training data. This approach significantly reduces the training volume for pre-training by utilizing a small amount of long sequence data for continuation pre-training instead of starting from scratch with long sequence pre-training.
 - **NTK**: Neural Tangent Kernel is a tool used for understanding and analyzing the behavior of deep neural networks. RoPE, employing NTK, enables dynamic interpolation of its frequencies. This involves scaling in the frequency domain while maintaining resolution, thereby achieving spatial interpolation in the positional domain.
 - **Data for SFT**: We autonomously construct a diverse range of long sequence data, encompassing single-document question-answering (QA), multi-document QA, summarization, code completion, and other types. The sequence lengths vary from 32K to 256K.
 ## 评测结果
+为了验证长序列的效果，这里我们使用了 LongBench 数据集。[ LongBench ](https://github.com/THUDM/LongBench)是第一个多任务、中英双语、针对大语言模型长文本理解能力的评测基准。 LongBench 由六大类、二十一个不同的任务组成，覆盖了单文档问答、多文档问答、摘要、Few shot任务、合成任务和代码补全等关键的长文本应用场景。 LongBench 包含 14 个英文任务、 5 个中文任务和 2 个代码任务，多数任务的平均长度在 5k-15k 之间，共包含 4750 条测试数据。评估结果如下：
 |  能力维度  |  数据集 |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
 Environment Setup:
+```bash
 pip install -r requirements.txt
+```
 可通过以下代码加载 XVERSE-13B-256K 模型进行对话：
 print(response)
 ```
+更多细节，包括对话 demo 、模型微调及量化等，请参考我们的[Github](https://github.com/xverse-ai/XVERSE-13B)。
 For more details, including chat demo, model fine-tuning and quantization, please refer to our [Github](https://github.com/xverse-ai/XVERSE-13B).