yuyijiong
/

Qwen-14b-chat-yarn-32k

@@ -11,10 +11,14 @@ pipeline_tag: text-generation
 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
-* 2023.12.16更新：发布[论文(中文版)](https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)、[论文(英文版)](https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)
 * 2023.12.14更新：发布经过微调的Qwen-14b-chat-yarn-32k，微调后的模型能适应32k长度（约4万汉字）的中英问答，相较于之前的通过位置插值得到的32k模型，几乎完全解决了多文档问答任务下召回率低（即 lost in middle 现象）的问题。
 <br>
 <br>
 # LongBench测试结果
 ### LongBench的passage_retrieval_zh的评测结果
 | 模型                           | 得分 (acc) |
@@ -73,9 +77,16 @@ print(response)
 <br>
 # 问答例子
 * 模型支持中文和英文，支持长文本总结、多文档问答、长文本问答、多轮对话等任务。
-* 经过指令微调，模型在面对多文档问答问题时，会先给出原文，再回答，内部幻觉问题显著降低。
 <details>
     <summary>多文档QA</summary>

 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
+* 2023.12.16更新：发布[论文(中文版)](https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)、[论文(英文版)](https://arxiv.org/abs/2312.11193)
 * 2023.12.14更新：发布经过微调的Qwen-14b-chat-yarn-32k，微调后的模型能适应32k长度（约4万汉字）的中英问答，相较于之前的通过位置插值得到的32k模型，几乎完全解决了多文档问答任务下召回率低（即 lost in middle 现象）的问题。
 <br>
+# 支持32k上下文的的Qwen-14b-chat模型
 <br>
 # LongBench测试结果
 ### LongBench的passage_retrieval_zh的评测结果
 | 模型                           | 得分 (acc) |
 <br>
+# 局限性
+1. 指令微调数据中不包含“所有参考文档中均不存在相关信息”的样本，导致模型在参考文档中不存在相关信息时，可能产生幻觉，编造答案，而不是回答“没有找到信息，无法回答”。后续可能会改进数据。
+2. 指令微调训练数据的类型依然不够多样，虽然已经涵盖了多个长文本场景，但可能难以适应一些长文本的新场景，例如agent能力、两个长文档的对比等。后续可能会增加数据多样性。
+3. 由于时间和计算资源有限，模型的评估数据集不够多，目前还缺乏一个全方面的能力评估。
+4. 由于时间和计算资源有限，我仅仅在32k的长度下进行训练和测试，使用同样的方法，能否使模型适应更长的context window（例如100k），还有待研究。
 # 问答例子
 * 模型支持中文和英文，支持长文本总结、多文档问答、长文本问答、多轮对话等任务。
+* 经过指令微调，模型在面对多文档问答问题时，会先给出原文，再回答。这一回答方式，使内部幻觉显著降低。
 <details>
     <summary>多文档QA</summary>

README_en.md CHANGED Viewed

@@ -11,9 +11,12 @@ pipeline_tag: text-generation
 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
-* Updated on December 16, 2023: Release [Paper](https://cloud.tsinghua.edu.cn/d/5894ec4442e54a6aac96/)
 * Updated on December 14, 2023: We have released the Qwen-14b-chat-yarn-32k model, which has been fine-tuned to handle Chinese and English question-answering tasks with a length of up to 32k (approximately 40,000 Chinese characters). This model addresses the low recall issue in multi-document question-answering tasks (also known as the "lost in middle" phenomenon) that was present in the previous 32k model obtained through position interpolation. <br>
 <br>
 # Evaluation results in LongBench
 ### Evaluation results for passage_retrieval_zh in LongBench
@@ -72,9 +75,16 @@ During training, use_dynamic_ntk was set to True.
 <br>
 # Model Q&A examples
 * The model supports both Chinese and English, and supports tasks such as long text summarization, multi document Q&A, long text Q&A, and multiple rounds of dialogue.
-* The model will first provide the original text before answering multi document Q&A questions, which significantly reduces internal hallucinations.
 <details>
     <summary>Multi-Doc QA</summary>

 ---
 **Read this in other languages: [English](README_en.md), [中文](README.md).**
+* Updated on December 16, 2023: Release [Paper](https://arxiv.org/abs/2312.11193)
 * Updated on December 14, 2023: We have released the Qwen-14b-chat-yarn-32k model, which has been fine-tuned to handle Chinese and English question-answering tasks with a length of up to 32k (approximately 40,000 Chinese characters). This model addresses the low recall issue in multi-document question-answering tasks (also known as the "lost in middle" phenomenon) that was present in the previous 32k model obtained through position interpolation. <br>
 <br>
+# Qwen-14b-chat model with 32k context window
 # Evaluation results in LongBench
 ### Evaluation results for passage_retrieval_zh in LongBench
 <br>
+# Limitations
+1. The fine-tuning data for instructions does not include samples where "there is no relevant information in any of the reference documents." This may cause the model to produce illusions and fabricate answers when there is no relevant information in the reference documents. The data may be improved in the future.
+2. The types of training data for fine-tuning instructions are still not diverse enough, which may make it difficult to adapt to new scenarios of long texts, such as agent capabilities or comparing two long documents. Increasing data diversity may be considered in the future.
+3. Due to limited time and computing resources, the evaluation for the model is not extensive enough, and there is currently a lack of comprehensive capability evaluation.
+4. Due to limited time and computing resources, I only trained and tested the model with a length of 32k. It remains to be studied whether the same approach can make the model adapt to a longer context window, such as 100k.
 # Model Q&A examples
 * The model supports both Chinese and English, and supports tasks such as long text summarization, multi document Q&A, long text Q&A, and multiple rounds of dialogue.
+* The model will first provide the original text before answering multi document Q&A questions. This answer format can significantly reduce internal hallucinations.
 <details>
     <summary>Multi-Doc QA</summary>