yuyijiong
/

Qwen-14b-chat-yarn-32k

@@ -18,13 +18,12 @@ pipeline_tag: text-generation
 * 2023.12.14更新：发布经过微调的Qwen-14b-chat-yarn-32k，微调后的模型能适应32k长度（约4万汉字）的中英问答，相较于之前的通过位置插值得到的32k模型，几乎完全解决了多文档问答任务下召回率低（即 lost in middle 现象）的问题。
 <br>
-# 支持32k上下文(可自动扩展至50k以上)的Qwen-14b-chat模型
 ## 模型的主要特性：
 * 基于Qwen-14b-chat，使用“原文复述”任务进行指令微调
 * 使用Yarn插值方法，使模型能适应32k甚至更长的文本
-* 推理时，无需特殊设计的prompt，即可给出高准确率的回答。
-* Qwen原有能力并未退化，依然能够适应各种任务。
 <br>
@@ -97,15 +96,14 @@ print(response)
 <br>
 # 局限性
-1. 指令微调数据中不包含“所有参考文档中均不存在相关信息”的样本，导致模型在参考文档中不存在相关信息时，可能产生幻觉，编造答案，而不是回答“没有找到信息，无法回答”。后续可能会改进数据。
 2. 指令微调训练数据的类型依然不够多样，虽然已经涵盖了多个长文本场景，但可能难以适应一些长文本的新场景，例如agent能力、两个长文档的对比等。后续可能会增加数据多样性。
 3. 由于时间和计算资源有限，模型的评估数据集不够多，目前还缺乏一个全方面的能力评估。
-4. 由于时间和计算资源有限，我仅仅在32k的长度下进行训练和测试，使用同样的方法，能否使模型适应更长的context window（例如100k），还有待研究。
 # 问答例子
 * 模型支持中文和英文，支持长文本总结、多文档问答、长文本问答、多轮对话等任务。
-* 经过指令微调，模型在面对多文档问答问题时，会先给出原文，再回答。这一回答方式，使内部幻觉显著降低。
 <details>
     <summary>多文档QA</summary>

 * 2023.12.14更新：发布经过微调的Qwen-14b-chat-yarn-32k，微调后的模型能适应32k长度（约4万汉字）的中英问答，相较于之前的通过位置插值得到的32k模型，几乎完全解决了多文档问答任务下召回率低（即 lost in middle 现象）的问题。
 <br>
+# 支持32k上下文的的Qwen-14b-chat模型
 ## 模型的主要特性：
 * 基于Qwen-14b-chat，使用“原文复述”任务进行指令微调
 * 使用Yarn插值方法，使模型能适应32k甚至更长的文本
+* 推理时，无需特定prompt，即可给出高准确率的回答。
 <br>
 <br>
 # 局限性
+1. 原文复述可能导致模型的回答过于冗长、不够简洁，后续可能会尝试改进数据。
 2. 指令微调训练数据的类型依然不够多样，虽然已经涵盖了多个长文本场景，但可能难以适应一些长文本的新场景，例如agent能力、两个长文档的对比等。后续可能会增加数据多样性。
 3. 由于时间和计算资源有限，模型的评估数据集不够多，目前还缺乏一个全方面的能力评估。
+4. 模型在C-Eval上的平均得分从69.1下降到60.8，通用能力有所下降，但仍然优于Qwen-7b。
 # 问答例子
 * 模型支持中文和英文，支持长文本总结、多文档问答、长文本问答、多轮对话等任务。
 <details>
     <summary>多文档QA</summary>

README_en.md CHANGED Viewed

@@ -94,15 +94,13 @@ During training, use_dynamic_ntk was set to True.
 <br>
 # Limitations
-1. The fine-tuning data for instructions does not include samples where "there is no relevant information in any of the reference documents." This may cause the model to produce illusions and fabricate answers when there is no relevant information in the reference documents. The data may be improved in the future.
 2. The types of training data for fine-tuning instructions are still not diverse enough, which may make it difficult to adapt to new scenarios of long texts, such as agent capabilities or comparing two long documents. Increasing data diversity may be considered in the future.
 3. Due to limited time and computing resources, the evaluation for the model is not extensive enough, and there is currently a lack of comprehensive capability evaluation.
-4. Due to limited time and computing resources, I only trained and tested the model with a length of 32k. It remains to be studied whether the same approach can make the model adapt to a longer context window, such as 100k.
 # Model Q&A examples
 * The model supports both Chinese and English, and supports tasks such as long text summarization, multi document Q&A, long text Q&A, and multiple rounds of dialogue.
-* The model will first provide the original text before answering multi document Q&A questions. This answer format can significantly reduce internal hallucinations.
 <details>
     <summary>Multi-Doc QA</summary>

 <br>
 # Limitations
+1. "Paraphrasing" the original text may cause the model's answers to be too verbose and not concise enough. This may be improved in the future.
 2. The types of training data for fine-tuning instructions are still not diverse enough, which may make it difficult to adapt to new scenarios of long texts, such as agent capabilities or comparing two long documents. Increasing data diversity may be considered in the future.
 3. Due to limited time and computing resources, the evaluation for the model is not extensive enough, and there is currently a lack of comprehensive capability evaluation.
+4. Model's average score on C-Eval dropped from 69.1 to 60.8. General capabilities declined, but still outperformed Qwen-7b.
 # Model Q&A examples
 * The model supports both Chinese and English, and supports tasks such as long text summarization, multi document Q&A, long text Q&A, and multiple rounds of dialogue.
 <details>
     <summary>Multi-Doc QA</summary>