Text Generation
Transformers
Safetensors
Chinese
English
qwen
conversational
custom_code
yuyijiong commited on
Commit
ae57e9f
1 Parent(s): 815cfe7

Upload 2 files

Browse files
Files changed (2) hide show
  1. README.md +4 -6
  2. README_en.md +2 -4
README.md CHANGED
@@ -18,13 +18,12 @@ pipeline_tag: text-generation
18
  * 2023.12.14更新:发布经过微调的Qwen-14b-chat-yarn-32k,微调后的模型能适应32k长度(约4万汉字)的中英问答,相较于之前的通过位置插值得到的32k模型,几乎完全解决了多文档问答任务下召回率低(即 lost in middle 现象)的问题。
19
  <br>
20
 
21
- # 支持32k上下文(可自动扩展至50k以上)的Qwen-14b-chat模型
22
 
23
  ## 模型的主要特性:
24
  * 基于Qwen-14b-chat,使用“原文复述”任务进行指令微调
25
  * 使用Yarn插值方法,使模型能适应32k甚至更长的文本
26
- * 推理时,无需特殊设计的prompt,即可给出高准确率的回答。
27
- * Qwen原有能力并未退化,依然能够适应各种任务。
28
 
29
  <br>
30
 
@@ -97,15 +96,14 @@ print(response)
97
  <br>
98
 
99
  # 局限性
100
- 1. 指令微调数据中不包含“所有参考文档中均不存在相关信息”的样本,导致模型在参考文档中不存在相关信息时,可能产生幻觉,编造答案,而不是回答“没有找到信息,无法回答”。后续可能会改进数据。
101
  2. 指令微调训练数据的类型依然不够多样,虽然已经涵盖了多个长文本场景,但可能难以适应一些长文本的新场景,例如agent能力、两个长文档的对比等。后续可能会增加数据多样性。
102
  3. 由于时间和计算资源有限,模型的评估数据集不够多,目前还缺乏一个全方面的能力评估。
103
- 4. 由于时间和计算资源有限,我仅仅在32k的长度下进行训练和测试,使用同样的方法,能否使模型适应更长的context window(例如100k),还有待研究。
104
 
105
 
106
  # 问答例子
107
  * 模型支持中文和英文,支持长文本总结、多文档问答、长文本问答、多轮对话等任务。
108
- * 经过指令微调,模型在面对多文档问答问题时,会先给出原文,再回答。这一回答方式,使内部幻觉显著降低。
109
 
110
  <details>
111
  <summary>多文档QA</summary>
 
18
  * 2023.12.14更新:发布经过微调的Qwen-14b-chat-yarn-32k,微调后的模型能适应32k长度(约4万汉字)的中英问答,相较于之前的通过位置插值得到的32k模型,几乎完全解决了多文档问答任务下召回率低(即 lost in middle 现象)的问题。
19
  <br>
20
 
21
+ # 支持32k上下文的的Qwen-14b-chat模型
22
 
23
  ## 模型的主要特性:
24
  * 基于Qwen-14b-chat,使用“原文复述”任务进行指令微调
25
  * 使用Yarn插值方法,使模型能适应32k甚至更长的文本
26
+ * 推理时,无需特定prompt,即可给出高准确率的回答。
 
27
 
28
  <br>
29
 
 
96
  <br>
97
 
98
  # 局限性
99
+ 1. 原文复述可能导致模型的回答过于冗长、不够简洁,后续可能会尝试改进数据。
100
  2. 指令微调训练数据的类型依然不够多样,虽然已经涵盖了多个长文本场景,但可能难以适应一些长文本的新场景,例如agent能力、两个长文档的对比等。后续可能会增加数据多样性。
101
  3. 由于时间和计算资源有限,模型的评估数据集不够多,目前还缺乏一个全方面的能力评估。
102
+ 4. 模型在C-Eval上的平均得分从69.1下降到60.8,通用能力有所下降,但仍然优于Qwen-7b。
103
 
104
 
105
  # 问答例子
106
  * 模型支持中文和英文,支持长文本总结、多文档问答、长文本问答、多轮对话等任务。
 
107
 
108
  <details>
109
  <summary>多文档QA</summary>
README_en.md CHANGED
@@ -94,15 +94,13 @@ During training, use_dynamic_ntk was set to True.
94
  <br>
95
 
96
  # Limitations
97
- 1. The fine-tuning data for instructions does not include samples where "there is no relevant information in any of the reference documents." This may cause the model to produce illusions and fabricate answers when there is no relevant information in the reference documents. The data may be improved in the future.
98
  2. The types of training data for fine-tuning instructions are still not diverse enough, which may make it difficult to adapt to new scenarios of long texts, such as agent capabilities or comparing two long documents. Increasing data diversity may be considered in the future.
99
  3. Due to limited time and computing resources, the evaluation for the model is not extensive enough, and there is currently a lack of comprehensive capability evaluation.
100
- 4. Due to limited time and computing resources, I only trained and tested the model with a length of 32k. It remains to be studied whether the same approach can make the model adapt to a longer context window, such as 100k.
101
-
102
 
103
  # Model Q&A examples
104
  * The model supports both Chinese and English, and supports tasks such as long text summarization, multi document Q&A, long text Q&A, and multiple rounds of dialogue.
105
- * The model will first provide the original text before answering multi document Q&A questions. This answer format can significantly reduce internal hallucinations.
106
 
107
  <details>
108
  <summary>Multi-Doc QA</summary>
 
94
  <br>
95
 
96
  # Limitations
97
+ 1. "Paraphrasing" the original text may cause the model's answers to be too verbose and not concise enough. This may be improved in the future.
98
  2. The types of training data for fine-tuning instructions are still not diverse enough, which may make it difficult to adapt to new scenarios of long texts, such as agent capabilities or comparing two long documents. Increasing data diversity may be considered in the future.
99
  3. Due to limited time and computing resources, the evaluation for the model is not extensive enough, and there is currently a lack of comprehensive capability evaluation.
100
+ 4. Model's average score on C-Eval dropped from 69.1 to 60.8. General capabilities declined, but still outperformed Qwen-7b.
 
101
 
102
  # Model Q&A examples
103
  * The model supports both Chinese and English, and supports tasks such as long text summarization, multi document Q&A, long text Q&A, and multiple rounds of dialogue.
 
104
 
105
  <details>
106
  <summary>Multi-Doc QA</summary>