xverse
/

XVERSE-13B-256K

Text Generation

Transformers

PyTorch

xverse

custom_code

Model card Files Files and versions Community

pom commited on Jan 12

Commit

698feb4

•

1 Parent(s): 4ca17e7

update readme

Browse files

Files changed (1) hide show

README.md +30 -49

README.md CHANGED Viewed

@@ -8,14 +8,14 @@ inference: false
 # XVERSE-13B-256K
 ## 更新信息
-**[2024/01/15]** 发布长序列对话模型**XVERSE-13B-256K**，该版本模型最大支持256K窗口长度，约25w字的输入内容，可以协助进行文献总结、报告分析等任务。
 **[2023/11/06]** 发布新版本的 **XVERSE-13B-2** 底座模型和 **XVERSE-13B-Chat-2** 对话模型，相较于原始版本，新版本的模型训练更加充分（从 1.4T 增加到 3.2T），各方面的能力均得到大幅提升，同时新增工具调用能力。
 **[2023/09/26]** 发布 7B 尺寸的 [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) 底座模型和 [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 对话模型，支持在单张消费级显卡部署运行，并保持高性能、全开源、免费可商用。
 **[2023/08/22]** 发布经过指令精调的 XVERSE-13B-Chat 对话模型。
 **[2023/08/07]** 发布 13B 尺寸的 XVERSE-13B 底座模型。
 ## Update Information
-**[2024/01/15]** Released the long-sequence model **XVERSE-13B-256K**. This model version supports a maximum window length of 256K, accommodating approximately 250,000 words for tasks such as literature summarization and report analysis.
 **[2023/11/06]** The new versions of the **XVERSE-13B-2** base model and the **XVERSE-13B-Chat-2** model have been released. Compared to the original versions, the new models have undergone more extensive training (increasing from 1.4T to 3.2T), resulting in significant improvements in all capabilities, along with the addition of Function Call abilities.
 **[2023/09/26]** Released the [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) base model and [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) instruct-finetuned model with 7B size, which support deployment and operation on a single consumer-grade graphics card while maintaining high performance, full open source, and free for commercial use.
 **[2023/08/22]** Released the aligned instruct-finetuned model XVERSE-13B-Chat.
@@ -30,7 +30,7 @@ inference: false
 - **ABF**：ABF的全称是Adjusted Base Frequency，表示将位置编码RoPE（Rotary Position Embedding）的频率从10000修改成500000。别小看这个数字的更改，它可以大幅减少前面序列attention的衰减速度，让后面的序列更好地获取所有序列的信息。
 - **继续预训练**：在XVERSE-13B的基础上，使用20%的预训练数据进行32K的长序列继续预训练。通过少量长序列数据的继续预训练而不是从头开始的长序列预训练，可以大幅减少预训练的训练量。
 - **NTK**：NTK的全称是Neural Tangent Kernel，翻译为神经正切核，是一种用于理解和分析深度神经网络行为的工具。使用了NTK的RoPE可以对RoPE的频率进行动态的插值。在保持分辨率的情况下（高频），进行频域空间的缩放（低频），从而实现位置空间的插值。
-- **SFT数据**：自主构建包含单文本QA，多文本QA，摘要，代码补全等各类长序列数据，序列长度从32K到256K不等。
 ## Model Introduction
@@ -46,68 +46,48 @@ updated by **Continual-Pre-Training** based on **ABF** and **supervised fine-tun
 ## 评测结果
-为了验证长序列的效果，这里我们使用了LongBench数据集。[LongBench](https://github.com/THUDM/LongBench)是第一个多任务、中英双语、针对大语言模型长文本理解能力的评测基准。LongBench由六大类、二十一个不同的任务组成，覆盖了单文档QA、多文档QA、摘要、Few-shot学习、合成任务和代码补全等关键的长文本应用场景。LongBench包含14个英文任务、5个中文任务和2个代码任务，多数任务的平均长度在5k-15k之间，共包含4750条测试数据。评估结果如下：
-|  能力维度  |  数据集 |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Llama2-7B-Chat-4K | LongChat-7B-16K | Yi-6B-200K |
 | :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
-|  单文档QA   |      narrativeqa      |    24.1     |    23.6    |     19.1      |    21.6     |    14.5    |
-|            |        qasper         |     30.2     |    43.3    |     19.6      |    21.6    |    21.6    |
-|            |    multifieldqa_en    |     43.5     |    52.3    |     35.8      |    44.6    |    36.6    |
-|            |    multifieldqa_zh    |     52.6     |    61.2    |     11.6      |    26.6    |    23.0    |
-|  多文档QA  |       hotpotqa         |     58.3     |    51.6    |     24.3      |    22.4    |    48.3    |
-|            |      2wikimqa         |     49.6     |    37.7    |     31.4      |    16.8    |    39.2    |
-|            |      musique          |     31.4     |    26.9    |     8.6       |    9.1     |    25.3    |
-|            |      dureader         |     28.9     |    28.7    |     1.9       |    19.1    |    14.2    |
-|    摘要    |     gov_report        |              |    29.5    |     27.3      |    28.4    |    29.5     |
-|            |      qmsum            |     16.8     |    23.4    |     20.6      |    23.2    |   20.4     |
-|            |     multinews         |     21.1     |    26.7    |     25.8      |    26.4    |    26.2    |
-|            |      vcsum            |     11.3     |    16.0    |     0.2       |    14.0    |    8.2     |
-|  Few-shot  |      trec             |     72.0     |    68.0    |     60.5      |    61.5    |    71.0    |
-|            |     samsum            |     34.6     |    41.7    |     31.4      |    44.8    |    11.3    |
-|            |    triviaqa           |     89.3     |    91.4    |     59.7      |    73.5    |    86.6    |
-|            |      lsht             |     35.0     |    29.2    |     19.8      |    20.8    |    38.0    |
-|  合成任务   |  passage_count        |     6.0     |     4.5     |     2.5       |    4.5     |    2.0     |
-|            |  passage_retrieval_en |     63.0    |     71.0    |     9.2       |    24.0    |    6.0     |
-|            |  passage_retrieval_zh |     44.0    |     77.5    |     0.5       |    4.8     |    7.9     |
-|  代码补全   |     lcc               |     55.2    |     54.7    |     52.3      |    59.2    |    64.6    |
-|            |  repobench-p          |             |     53.6    |     42.4      |    54.7    |    61.5    |
 对于上述所有比较模型，我们优先汇报其官方公布的结果。在缺少官方结果的情况下，我们采用自行执行的评估流程所获得的数据。
 ## Model Evaluation
-To assess the performance of long sequences, we employed the LongBench dataset. [LongBench](https://github.com/THUDM/LongBench) stands as the inaugural multi-task, bilingual (English-Chinese), evaluation benchmark specifically designed to gauge the long-text comprehension capabilities of large language models. Comprising six major categories and twenty-one distinct tasks, LongBench encompasses critical long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot learning, synthetic tasks, and code completion. The dataset consists of 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the majority of tasks having an average length ranging from 5,000 to 15,000 tokens, totaling 4,750 test instances. The evaluation results are presented below:
-|  Capability Dimension  |  Dataset  |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Llama2-7B-Chat-4K | LongChat-7B-16K | Yi-6B-200K |
 | :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
-|single-document QA|narrativeqa      |    24.1      |    23.6    |     19.1      |    21.6    |    14.5    |
-|            |        qasper         |     30.2     |    43.3    |     19.6      |    21.6    |    21.6    |
-|            |    multifieldqa_en    |     43.5     |    52.3    |     35.8      |    44.6    |    36.6    |
-|            |    multifieldqa_zh    |     52.6     |    61.2    |     11.6      |    26.6    |    23.0    |
-|multi-document QA| hotpotqa         |     58.3     |    51.6    |     24.3      |    22.4    |    48.3    |
-|            |      2wikimqa         |     49.6     |    37.7    |     31.4      |    16.8    |    39.2    |
-|            |      musique          |     31.4     |    26.9    |     8.6       |    9.1     |    25.3    |
-|            |      dureader         |     28.9     |    28.7    |     1.9       |    19.1    |    14.2    |
-|summarization |   gov_report        |              |    29.5    |     27.3      |    28.4    |    29.5    |
-|            |      qmsum            |     16.8     |    23.4    |     20.6      |    23.2    |    20.4    |
-|            |     multinews         |     21.1     |    26.7    |     25.8      |    26.4    |    26.2    |
-|            |      vcsum            |     11.3     |    16.0    |     0.2       |    14.0    |    8.2     |
-|  Few-shot  |      trec             |     72.0     |    68.0    |     60.5      |    61.5    |    71.0    |
-|            |     samsum            |     34.6     |    41.7    |     31.4      |    44.8    |    11.3    |
-|            |    triviaqa           |     89.3     |    91.4    |     59.7      |    73.5    |    86.6    |
-|            |      lsht             |     35.0     |    29.2    |     19.8      |    20.8    |    38.0    |
-|synthetic tasks|  passage_count     |     6.0      |    4.5     |     2.5       |    4.5     |    2.0     |
-|            |  passage_retrieval_en |     63.0     |    71.0    |     9.2       |    24.0    |    6.0     |
-|            |  passage_retrieval_zh |     44.0     |    77.5    |     0.5       |    4.8     |    7.9     |
-|code completion|     lcc            |     55.2     |    54.7    |     52.3      |    59.2    |    64.6    |
-|            |  repobench-p          |              |    53.6    |     42.4      |    54.7    |    61.5    |
 For all the comparison models mentioned above, we prioritize the disclosure of their officially published results. In the absence of official data, we refer to the results derived from our own evaluation pipline.
 ### Loading with Transformers
 需要安装的：
 Reqirements:
 transformers==4.31.0
@@ -119,6 +99,7 @@ xformers
 可通过以下代码加载 XVERSE-13B-256K 模型进行对话：
 The XVERSE-13B-256K model can be loaded for chat using the following code:
 ```python

 # XVERSE-13B-256K
 ## 更新信息
+**[2024/01/16]** 发布长序列对话模型**XVERSE-13B-256K**，该版本模型最大支持256K窗口长度，约25w字的输入内容，可以协助进行文献总结、报告分析等任务。
 **[2023/11/06]** 发布新版本的 **XVERSE-13B-2** 底座模型和 **XVERSE-13B-Chat-2** 对话模型，相较于原始版本，新版本的模型训练更加充分（从 1.4T 增加到 3.2T），各方面的能力均得到大幅提升，同时新增工具调用能力。
 **[2023/09/26]** 发布 7B 尺寸的 [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) 底座模型和 [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) 对话模型，支持在单张消费级显卡部署运行，并保持高性能、全开源、免费可商用。
 **[2023/08/22]** 发布经过指令精调的 XVERSE-13B-Chat 对话模型。
 **[2023/08/07]** 发布 13B 尺寸的 XVERSE-13B 底座模型。
 ## Update Information
+**[2024/01/16]** Released the long-sequence model **XVERSE-13B-256K**. This model version supports a maximum window length of 256K, accommodating approximately 250,000 words for tasks such as literature summarization and report analysis.
 **[2023/11/06]** The new versions of the **XVERSE-13B-2** base model and the **XVERSE-13B-Chat-2** model have been released. Compared to the original versions, the new models have undergone more extensive training (increasing from 1.4T to 3.2T), resulting in significant improvements in all capabilities, along with the addition of Function Call abilities.
 **[2023/09/26]** Released the [XVERSE-7B](https://github.com/xverse-ai/XVERSE-7B) base model and [XVERSE-7B-Chat](https://github.com/xverse-ai/XVERSE-7B) instruct-finetuned model with 7B size, which support deployment and operation on a single consumer-grade graphics card while maintaining high performance, full open source, and free for commercial use.
 **[2023/08/22]** Released the aligned instruct-finetuned model XVERSE-13B-Chat.
 - **ABF**：ABF的全称是Adjusted Base Frequency，表示将位置编码RoPE（Rotary Position Embedding）的频率从10000修改成500000。别小看这个数字的更改，它可以大幅减少前面序列attention的衰减速度，让后面的序列更好地获取所有序列的信息。
 - **继续预训练**：在XVERSE-13B的基础上，使用20%的预训练数据进行32K的长序列继续预训练。通过少量长序列数据的继续预训练而不是从头开始的长序列预训练，可以大幅减少预训练的训练量。
 - **NTK**：NTK的全称是Neural Tangent Kernel，翻译为神经正切核，是一种用于理解和分析深度神经网络行为的工具。使用了NTK的RoPE可以对RoPE的频率进行动态的插值。在保持分辨率的情况下（高频），进行频域空间的缩放（低频），从而实现位置空间的插值。
+- **SFT数据**：自主构建包含单文本问答，多文本问答，摘要，代码补全等各类长序列数据，序列长度从32K到256K不等。
 ## Model Introduction
 ## 评测结果
+为了验证长序列的效果，这里我们使用了LongBench数据集。[LongBench](https://github.com/THUDM/LongBench)是第一个多任务、中英双语、针对大语言模型长文本理解能力的评测基准。LongBench由六大类、二十一个不同的任务组成，覆盖了单文档问答、多文档问答、摘要、Few shot任务、合成任务和代码补全等关键的长文本应用场景。LongBench包含14个英文任务、5个中文任务和2个代码任务，多数任务的平均长度在5k-15k之间，共包含4750条测试数据。评估结果如下：
+|  能力维度  |  数据集 |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
 | :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
+|  多文档问答  |      HotpotQA         |     58.3     |    51.6    |     48.3      |    22.4    |    24.3    |
+|             |      DuReader         |     28.9     |    28.7    |     14.2       |    19.1    |    1.9    |
+|  单文档问答  |      NarrativeQA      |    24.1      |    23.6    |     14.5      |    21.6    |    19.1    |
+|             |       Qasper          |     30.2     |    43.3    |     21.6      |    21.6    |    19.6    |
+|    摘要     |      VCSUM            |     11.3     |    16.0    |      8.2       |    14.0   |    0.2     |
+|  Few shot   |      TREC             |     72.0     |    68.0    |     71.0      |    61.5    |    60.5    |
+|             |      LSHT             |     35.0     |    29.2    |     38.0      |    20.8    |    19.8    |
+|  合成任务    |  PassageRetrieval-en |     63.0     |    71.0    |     6.0       |    24.0    |    9.2     |
+|             |  PassageRetrieval-zh |     44.0     |    77.5    |     7.9       |    4.8     |    0.5     |
+|      代码   |  RepoBench-P          |    55.6     |    53.6    |     61.5      |    54.7    |    42.4    |
 对于上述所有比较模型，我们优先汇报其官方公布的结果。在缺少官方结果的情况下，我们采用自行执行的评估流程所获得的数据。
 ## Model Evaluation
+To assess the performance of long sequences, we employed the LongBench dataset. [LongBench](https://github.com/THUDM/LongBench) stands as the inaugural multi-task, bilingual (English-Chinese), evaluation benchmark specifically designed to gauge the long-text comprehension capabilities of large language models. Comprising six major categories and twenty-one distinct tasks, LongBench encompasses critical long-text application scenarios such as single-document QA, multi-document QA, summarization, few-shot tasks, synthetic tasks, and code completion. The dataset consists of 14 English tasks, 5 Chinese tasks, and 2 code tasks, with the majority of tasks having an average length ranging from 5,000 to 15,000 tokens, totaling 4,750 test instances. The evaluation results are presented below:
+|  Capability Dimension  |  Dataset |  XVERSE-13B-256K | GPT-3.5-Turbo-16K | Yi-6B-200K | LongChat-7B-16K | Llama2-7B-Chat-4K |
 | :--------: | :-------------------: | :----: | :----------: | :--------: | :-----------: | :--------: |
+|  multi-document QA  |      HotpotQA         |     58.3     |    51.6    |     48.3      |    22.4    |    24.3    |
+|                     |      DuReader         |     28.9     |    28.7    |     14.2      |    19.1    |    1.9     |
+|  single-document QA |      NarrativeQA      |     24.1     |    23.6    |     14.5      |    21.6    |    19.1    |
+|                     |       Qasper          |     30.2     |    43.3    |     21.6      |    21.6    |    19.6    |
+|    summarization    |      VCSUM            |     11.3     |    16.0    |      8.2      |    14.0    |    0.2     |
+|    Few shot         |      TREC             |     72.0     |    68.0    |     71.0      |    61.5    |    60.5    |
+|                     |      LSHT             |     35.0     |    29.2    |     38.0      |    20.8    |    19.8    |
+|  synthetic tasks    |  PassageRetrieval-en  |     63.0     |    71.0    |     6.0       |    24.0    |    9.2     |
+|                     |  PassageRetrieval-zh  |     44.0     |    77.5    |     7.9       |    4.8     |    0.5     |
+|   code completion   |  RepoBench-P          |     55.6     |    53.6    |     61.5      |    54.7    |    42.4    |
 For all the comparison models mentioned above, we prioritize the disclosure of their officially published results. In the absence of official data, we refer to the results derived from our own evaluation pipline.
 ### Loading with Transformers
 需要安装的：
 Reqirements:
 transformers==4.31.0
 可通过以下代码加载 XVERSE-13B-256K 模型进行对话：
 The XVERSE-13B-256K model can be loaded for chat using the following code:
 ```python