Update README.md
Browse files
README.md
CHANGED
@@ -3,20 +3,19 @@ library_name: transformers
|
|
3 |
language:
|
4 |
- zh
|
5 |
---
|
|
|
6 |
# Kyara: Knowledge Yielding Adaptive Retrieval Augmentation for LLM Fine-tuning
|
7 |
|
8 |
<p align="left">
|
9 |
-
🤗 <a href="https://huggingface.co/zake7749/gemma-2-2b-it-chinese-kyara-dpo">Hugging Face</a>  | 🚀<a href="https://github.com/zake7749/kyara">Github</a>  |  📑 <a href="#">Paper</a>  |  📖 <a href="#">English</a>
|
10 |
</p>
|
11 |
<div style="text-align: center;">
|
12 |
<img src="https://i.imgur.com/QiWlcYJ.jpeg" alt="kyara"/>
|
13 |
</div>
|
14 |
|
15 |
-
Kyara
|
16 |
-
|
17 |
-
與此同時,Kyara 也致力於填補中文語料庫,特別是繁體中文的空白。在當前語言模型研究中,英文資料豐富多樣,中文卻面臨語料匱乏的挑戰,這無疑為學術研究設立了一道難以逾越的高牆。
|
18 |
|
19 |
-
|
20 |
|
21 |
## Table of Content
|
22 |
|
@@ -47,9 +46,9 @@ Kyara 是一個實驗性的語言模型微調策略,旨在通過知識檢索
|
|
47 |
|
48 |
### **General Benchmark**
|
49 |
|
50 |
-
|
51 |
|
52 |
-
| Metric | Kyara-2b-it | Gemma-2b-it |
|
53 |
|--------------------------|----------|-------------|
|
54 |
| **[TMMLUPlus](https://huggingface.co/datasets/ikala/tmmluplus)** | **39.22** | 36.73 |
|
55 |
|  - STEM | **40.86** | 37.84 |
|
@@ -62,10 +61,11 @@ Kyara 是一個實驗性的語言模型微調策略,旨在通過知識檢索
|
|
62 |
| **[CRUX](https://github.com/yuchenlin/ZeroEval)** | **21.25**| 20.75 |
|
63 |
| **Chinese-Reason-Bench** | **3.97** | 3.44 |
|
64 |
|
|
|
65 |
|
66 |
### **Alignment Benchmark**
|
67 |
|
68 |
-
| Metric | Kyara | Gemma-2b-it |
|
69 |
|--------------------------|----------|-------------|
|
70 |
| **[AlpacaEval-LC](https://github.com/tatsu-lab/alpaca_eval)** | **41.15**| 32.37 |
|
71 |
| **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)** | **43.95**| 32.94 |
|
@@ -74,57 +74,46 @@ Kyara 是一個實驗性的語言模型微調策略,旨在通過知識檢索
|
|
74 |
|
75 |
## Feature
|
76 |
|
77 |
-
|
78 |
|
79 |
### System Prompt
|
80 |
|
81 |
-
Kyara
|
82 |
|
83 |
#### Input
|
84 |
|
85 |
```
|
86 |
-
|
87 |
-
|
88 |
-
]
|
89 |
```
|
90 |
|
91 |
#### Output
|
92 |
|
93 |
```
|
94 |
-
大型語言模型(Large Language
|
95 |
|
96 |
-
|
97 |
-
- 大型語言模型通常需要處理數以百萬計甚至數十億個字詞和句子。這使它們能夠捕捉到語言中的豐富多樣性。
|
98 |
|
99 |
-
|
100 |
-
- Transformer架構在處理序列數據(如文本)時特別有效。它利用自注意力機制,能夠在生成文本過程中考慮到遠距離的依賴關係。
|
101 |
|
102 |
-
|
103 |
-
- **預訓練**:最初在大規模未標註語料上進行訓練,以學習基本的語言結構和知識。
|
104 |
-
- **微調**:在特定任務或應用場景上進行細化訓練。
|
105 |
|
106 |
-
|
107 |
-
- **文本生成**:如自動寫作、對話系統等。
|
108 |
-
- **語言翻譯**:自動翻譯文本。
|
109 |
-
- **問答系統**:回答用戶的提問。
|
110 |
-
- **文本分類**:自動歸類和篩選文本。
|
111 |
-
- **情感分析**:分析文本中的情感傾向。
|
112 |
|
113 |
-
|
114 |
-
- 大型語言模型通常在多個自然語言處理(NLP)基準測試中表現出色,超越了許多先前的模型。
|
115 |
|
116 |
-
|
117 |
|
118 |
-
|
|
|
|
|
119 |
```
|
120 |
|
121 |
#### Input
|
122 |
|
123 |
```
|
124 |
-
|
125 |
-
|
126 |
-
{'role': 'user', 'content': '什麼是大型語言模型?'}
|
127 |
-
]
|
128 |
```
|
129 |
|
130 |
#### Output
|
@@ -145,7 +134,7 @@ messages = [
|
|
145 |
|
146 |
### Retrieval Augmented Generation (Experimental)
|
147 |
|
148 |
-
|
149 |
|
150 |
#### Input
|
151 |
|
@@ -236,21 +225,21 @@ messages = [
|
|
236 |
因此,為了維護呼吸道健康,應儘量通過鼻子進行呼吸,以減少上述問題的發生。
|
237 |
```
|
238 |
|
239 |
-
|
240 |
|
241 |
## Method
|
242 |
|
243 |
-
|
244 |
|
245 |
### Dataset Summary
|
246 |
|
247 |
-
|
248 |
|
249 |
-
*
|
250 |
|
251 |
<img src="https://i.imgur.com/KvVjti4.png" alt="language-distribution" width="500"/>
|
252 |
|
253 |
-
*
|
254 |
|
255 |
<img src="https://i.imgur.com/dekAnU0.png" alt="conv-round-distribution" width="500"/>
|
256 |
|
@@ -258,55 +247,55 @@ messages = [
|
|
258 |
|
259 |
#### Base Dataset: Knowledge Injection with Retrieval Augmentation
|
260 |
|
261 |
-
|
262 |
|
263 |
-
1.
|
264 |
-
2. (
|
265 |
-
|
266 |
-
|
267 |
-
|
268 |
-
|
269 |
|
270 |
-
|
271 |
|
272 |
##### Chinese Math Dataset
|
273 |
|
274 |
-
*
|
275 |
|
276 |
-
|
277 |
|
278 |
#### High Quality Dataset: Model Refinement
|
279 |
|
280 |
-
|
281 |
|
282 |
-
1. Base Dataset
|
283 |
-
2.
|
284 |
-
3. Base Dataset
|
285 |
|
286 |
-
|
287 |
|
288 |
### Preference Learning
|
289 |
|
290 |
-
|
291 |
|
292 |
-
Kyara
|
293 |
|
294 |
* [argilla/ultrafeedback-binarized-preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences)
|
295 |
* [xinlai/Math-Step-DPO-10K](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K)
|
296 |
|
297 |
-
|
298 |
|
299 |
#### Chinese DPO
|
300 |
|
301 |
##### [SPIN/SPPO](https://github.com/uclaml/SPIN)
|
302 |
|
303 |
-
|
304 |
|
305 |
##### RLAIF
|
306 |
|
307 |
資料集:[zake7749/kyara-chinese-preference-dpo-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-preference-dpo-s0-30K)
|
308 |
|
309 |
-
|
310 |
|
311 |
* GPT-4o
|
312 |
* GPT-4-0618
|
@@ -318,7 +307,7 @@ Kyara 的偏好學習策略採用了 [Direct Preference Optimization (DPO)](http
|
|
318 |
* Qwen2-72B-Instruct
|
319 |
* DeepSeek V2
|
320 |
|
321 |
-
|
322 |
|
323 |
```
|
324 |
**[Task]**
|
@@ -349,15 +338,17 @@ If the answers from A and B are very similar in terms of correctness, helpfulnes
|
|
349 |
---
|
350 |
```
|
351 |
|
|
|
|
|
352 |
## Limitation
|
353 |
|
354 |
-
|
355 |
|
356 |
```
|
357 |
如何解釋「俠者,不受名利所牽,但求俯仰無愧。正邪存乎在我,聲名留問諸君。非我桀傲不馴,乃因唐門薪火,自古如今!」
|
358 |
```
|
359 |
|
360 |
-
|
361 |
|
362 |
```
|
363 |
這段話出自《射鵰英雄傳》中的唐門內部經典,由郭靖和黃蓉在唐門大殿上宣讀。這段話主要表達了唐門門人對於俠義精神和內在修為的追求,以及他們對外界的名利和聲名的不屑。以下是詳細解釋:
|
@@ -377,4 +368,4 @@ If the answers from A and B are very similar in terms of correctness, helpfulnes
|
|
377 |
總結來說,這段話強調了唐門門人對名利和聲名的不屑,他們追求的是內在的修為和道德,不為外界的評價所動搖。這也體現了他們俠義精神和對正義的堅持。
|
378 |
```
|
379 |
|
380 |
-
|
|
|
3 |
language:
|
4 |
- zh
|
5 |
---
|
6 |
+
|
7 |
# Kyara: Knowledge Yielding Adaptive Retrieval Augmentation for LLM Fine-tuning
|
8 |
|
9 |
<p align="left">
|
10 |
+
🤗 <a href="https://huggingface.co/zake7749/gemma-2-2b-it-chinese-kyara-dpo">Hugging Face</a>  | 🚀<a href="https://github.com/zake7749/kyara">Github</a>  |  📑 <a href="#">Paper</a>  |  📖 <a href="#">English</a>  |  📖 <a href="https://github.com/zake7749/kyara">Chinese</a>
|
11 |
</p>
|
12 |
<div style="text-align: center;">
|
13 |
<img src="https://i.imgur.com/QiWlcYJ.jpeg" alt="kyara"/>
|
14 |
</div>
|
15 |
|
16 |
+
Kyara is an experimental strategy for fine-tuning language models, designed to effectively enhance the model's knowledge adaptation and language understanding capabilities through knowledge retrieval augmentation.
|
|
|
|
|
17 |
|
18 |
+
To validate the effectiveness of this method, We conducted full-parameter fine-tuning on `Gemma-2-2b-it`, resulting in the first version of the Kyara model. Preliminary evaluation results can be seen in the [Benchmark](#benchmark) section.
|
19 |
|
20 |
## Table of Content
|
21 |
|
|
|
46 |
|
47 |
### **General Benchmark**
|
48 |
|
49 |
+
All evaluations are based-on zero-shot.
|
50 |
|
51 |
+
| Metric | Kyara-2b-it | Gemma-2-2b-it |
|
52 |
|--------------------------|----------|-------------|
|
53 |
| **[TMMLUPlus](https://huggingface.co/datasets/ikala/tmmluplus)** | **39.22** | 36.73 |
|
54 |
|  - STEM | **40.86** | 37.84 |
|
|
|
61 |
| **[CRUX](https://github.com/yuchenlin/ZeroEval)** | **21.25**| 20.75 |
|
62 |
| **Chinese-Reason-Bench** | **3.97** | 3.44 |
|
63 |
|
64 |
+
The aggregation method for TMMLUPlus's groups is macro average, following the practice in the official implementation.
|
65 |
|
66 |
### **Alignment Benchmark**
|
67 |
|
68 |
+
| Metric | Kyara | Gemma-2-2b-it |
|
69 |
|--------------------------|----------|-------------|
|
70 |
| **[AlpacaEval-LC](https://github.com/tatsu-lab/alpaca_eval)** | **41.15**| 32.37 |
|
71 |
| **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)** | **43.95**| 32.94 |
|
|
|
74 |
|
75 |
## Feature
|
76 |
|
77 |
+
Compared to Gemma-2-2b-it, we have introduced the following additional features:
|
78 |
|
79 |
### System Prompt
|
80 |
|
81 |
+
Kyara supports System Prompt, as the following example:
|
82 |
|
83 |
#### Input
|
84 |
|
85 |
```
|
86 |
+
System: 你是一個愛賣弄知識的人,總是喜歡引用典故,把簡單的事情說得非常複雜。
|
87 |
+
User: 什麼是大型語言模型?
|
|
|
88 |
```
|
89 |
|
90 |
#### Output
|
91 |
|
92 |
```
|
93 |
+
大型語言模型(Large Language Model,簡稱LLM)是一種基於深度學習技術的自然語言處理(NLP)模型。這些模型通過大量文本數據進行訓練,以學習和生成人類語言。具體來說,它們屬於一種被稱為「變換器」(Transformer)的架構。
|
94 |
|
95 |
+
變換器最初由Vaswani等人在2017年的論文《Attention is All You Need》中提出,這一架構因其在處理序列數據上的卓越性能而迅速風靡。變換器的核心思想是利用「自注意力機制」(Self-Attention Mechanism),這是一種能夠捕捉文本中不同單詞之間長距離��賴關係的方法。
|
|
|
96 |
|
97 |
+
大型語言模型通常具有以下幾個特徵:
|
|
|
98 |
|
99 |
+
1. **訓練數據量巨大**:這些模型通常需要數百GB甚至TB級的訓練數據,以捕捉語言的豐富性和多樣性。
|
|
|
|
|
100 |
|
101 |
+
2. **模型規模龐大**:這意味著模型包含大量參數,常見的規模包括幾十億到數萬億個參數。這些參數的數量決定了模型的複雜度和生成文本的質量。
|
|
|
|
|
|
|
|
|
|
|
102 |
|
103 |
+
3. **多語言能力**:現代的LLM如GPT-3、BERT等,不僅能夠處理多種語言,還能夠在不同語言之間轉換和理解上下文。
|
|
|
104 |
|
105 |
+
4. **應用廣泛**:這些模型在文本生成、情感分析、問答系統、機器翻譯、摘要生成等多個NLP任務中都有廣泛應用。
|
106 |
|
107 |
+
例如,OpenAI的GPT-3包含1750億個參數,這使其成為一個極其強大的語言生成模型。BERT則以其雙向編碼器和掩碼語言建模技術,在多種下游任務中取得了卓越的成績。
|
108 |
+
|
109 |
+
總結來說,大型語言模型通過大量數據訓練,利用複雜的深度學習架構,能夠生成高質量的自然語言文本,從而在各種NLP應用中發揮重要作用。
|
110 |
```
|
111 |
|
112 |
#### Input
|
113 |
|
114 |
```
|
115 |
+
System: 你是一名小學老師,會用五歲小孩也能懂的口吻回答使用者的問題
|
116 |
+
User: 什麼是大型語言模型?
|
|
|
|
|
117 |
```
|
118 |
|
119 |
#### Output
|
|
|
134 |
|
135 |
### Retrieval Augmented Generation (Experimental)
|
136 |
|
137 |
+
Benefiting from Kyara's training method, we incorporated RAG-related content during the SFT phase. You can refer to the following examples to construct task templates:
|
138 |
|
139 |
#### Input
|
140 |
|
|
|
225 |
因此,為了維護呼吸道健康,應儘量通過鼻子進行呼吸,以減少上述問題的發生。
|
226 |
```
|
227 |
|
228 |
+
Please refer to [Kyara-RAG](https://github.com/zake7749/kyara-rag) for more details.
|
229 |
|
230 |
## Method
|
231 |
|
232 |
+
The following sections provide a brief summary of Kyara's implementation strategy.
|
233 |
|
234 |
### Dataset Summary
|
235 |
|
236 |
+
We have collected a total of 2.6M conversations, approximately 3.54 billion tokens. The following provides an overview of the language distribution and conversation rounds.
|
237 |
|
238 |
+
* Language:
|
239 |
|
240 |
<img src="https://i.imgur.com/KvVjti4.png" alt="language-distribution" width="500"/>
|
241 |
|
242 |
+
* Conversation Rounds:
|
243 |
|
244 |
<img src="https://i.imgur.com/dekAnU0.png" alt="conv-round-distribution" width="500"/>
|
245 |
|
|
|
247 |
|
248 |
#### Base Dataset: Knowledge Injection with Retrieval Augmentation
|
249 |
|
250 |
+
We developed a knowledge search system using open Chinese knowledge corpora, integrated with [QDrant](https://qdrant.tech/). To construct Supervised Fine-Tuning(SFT) pairs, we followed this process:
|
251 |
|
252 |
+
1. Sample documents from the knowledge base and generate knowledge-intensive questions that users might ask based on these texts.
|
253 |
+
2. (Optional) Increase instruction complexity using [Evol-Instruct](https://arxiv.org/pdf/2304.12244).
|
254 |
+
3. Apply query expansion on the generated instructions to retrieve additional Top K documents and individually assess their relevance:
|
255 |
+
* For relevant documents, use an LLM to summarize key information related to the question.
|
256 |
+
* For irrelevant documents, ignore them.
|
257 |
+
4. Let the LLM generate a detailed and comprehensive response according to the original document and K supplementary references.
|
258 |
|
259 |
+
Besides, we would also aks LLM to generate an user prompt for high quality documents, and pair the (generated prompt, original document) as a SFT example.
|
260 |
|
261 |
##### Chinese Math Dataset
|
262 |
|
263 |
+
* Dataset: [zake7749/kyara-chinese-math-sft-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-math-sft-s0-30K)
|
264 |
|
265 |
+
While the aforementioned strategy can generate a wide range of knowledge-based texts, it primarily falls within the scope of information-seeking tasks and is not very effective in constructing mathematical and reasoning-related content. To address this, we generated 50,000 math problems based on [PersonaHub](https://huggingface.co/datasets/proj-persona/PersonaHub). We then used `Gemini-1.5-Flash` to filter out data with obvious errors in calculation and reasoning, thereby creating [kyara-chinese-math-sft-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-math-sft-s0-30K).
|
266 |
|
267 |
#### High Quality Dataset: Model Refinement
|
268 |
|
269 |
+
After completing supervised learning using the base dataset, we will fine-tune the LLM again on a high-quality subset, primarily to address the following three issues:
|
270 |
|
271 |
+
1. Some responses in the Base Dataset were generated from small model, which sometimes performed poorly in following instructions.
|
272 |
+
2. We used various LLMs in the previous step to introduce knowledge diversity and language adaptability. However, we discovered subtle differences in response templates and reasoning approaches between different LLMs, leading to occasional instability in the trained Chat Model. Therefore, we would like to introduced a high-quality small dataset, using a single strong LLM to generate QA Pairs.
|
273 |
+
3. The Base Dataset includes some Q&A Pairs composed of generated queries and original document. While these data are rich in knowledge, they are relatively weak in terms of instruction following.
|
274 |
|
275 |
+
To balance data diversity and quality, we adopted a strategy similar to [InsTag](https://arxiv.org/abs/2308.07074) to classify the data. We then used [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) and an LLM Judge to evaluate data quality, finally extracting the best training data from each category to create the Stage 1 Dataset of about 200K, which was used to fine-tune the Kyara-SFT Model again.
|
276 |
|
277 |
### Preference Learning
|
278 |
|
279 |
+
We introduced Preference Learning in Kyara, which allows the model's responses to better align with human preferences while enhancing programming skills and mathematical reasoning abilities.
|
280 |
|
281 |
+
Kyara’s preference learning strategy utilizes Direct Preference Optimization (DPO), integrating two custom-built Chinese datasets alongside two English datasets.
|
282 |
|
283 |
* [argilla/ultrafeedback-binarized-preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences)
|
284 |
* [xinlai/Math-Step-DPO-10K](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K)
|
285 |
|
286 |
+
Here, we summarize the construction strategy of the Chinese datasets.
|
287 |
|
288 |
#### Chinese DPO
|
289 |
|
290 |
##### [SPIN/SPPO](https://github.com/uclaml/SPIN)
|
291 |
|
292 |
+
We followed the original design, using Kyara-SFT to generate a set of contrastive data for the High Quality Dataset.
|
293 |
|
294 |
##### RLAIF
|
295 |
|
296 |
資料集:[zake7749/kyara-chinese-preference-dpo-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-preference-dpo-s0-30K)
|
297 |
|
298 |
+
We extracted Chinese Prompts from `Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese`, `hfl/stem_zh_instruction`, and `FreedomIntelligence/Evol-Instruct-Chinese-GPT4`, and distributed the same prompt to four different LLMs. The competitors include:
|
299 |
|
300 |
* GPT-4o
|
301 |
* GPT-4-0618
|
|
|
307 |
* Qwen2-72B-Instruct
|
308 |
* DeepSeek V2
|
309 |
|
310 |
+
After response generation, we ask the LLMs to judge which one is better, using the following prompt:
|
311 |
|
312 |
```
|
313 |
**[Task]**
|
|
|
338 |
---
|
339 |
```
|
340 |
|
341 |
+
Finally, all four datasets were combined for DPO training.
|
342 |
+
|
343 |
## Limitation
|
344 |
|
345 |
+
Like most models, Kyara still suffers from hallucinations, and possibly due to the nature of the data, it particularly enjoys quoting references when answering questions. For example, if you ask the model to explain a line from "Legend of Mortal":
|
346 |
|
347 |
```
|
348 |
如何解釋「俠者,不受名利所牽,但求俯仰無愧。正邪存乎在我,聲名留問諸君。非我桀傲不馴,乃因唐門薪火,自古如今!」
|
349 |
```
|
350 |
|
351 |
+
The model would respond that this quote is from The "Legend of the Condor Heroes."
|
352 |
|
353 |
```
|
354 |
這段話出自《射鵰英雄傳》中的唐門內部經典,由郭靖和黃蓉在唐門大殿上宣讀。這段話主要表達了唐門門人對於俠義精神和內在修為的追求,以及他們對外界的名利和聲名的不屑。以下是詳細解釋:
|
|
|
368 |
總結來說,這段話強調了唐門門人對名利和聲名的不屑,他們追求的是內在的修為和道德,不為外界的評價所動搖。這也體現了他們俠義精神和對正義的堅持。
|
369 |
```
|
370 |
|
371 |
+
It is recommended to exercise caution when using language models.
|