Update README.md
Browse files
README.md
CHANGED
@@ -18,24 +18,15 @@ license: mit
|
|
18 |
<img src="https://i.imgur.com/QiWlcYJ.jpeg" alt="kyara"/>
|
19 |
</div>
|
20 |
|
21 |
-
Kyara is an experimental
|
22 |
|
23 |
-
To validate the effectiveness of
|
24 |
|
25 |
## Table of Content
|
26 |
|
27 |
- [Benchmark](#benchmark)
|
28 |
-
* [
|
29 |
-
* [
|
30 |
-
- [Feature](#feature)
|
31 |
-
* [System Prompt](#system-prompt)
|
32 |
-
+ [Input](#input)
|
33 |
-
+ [Output](#output)
|
34 |
-
+ [Input](#input-1)
|
35 |
-
+ [Output](#output-1)
|
36 |
-
* [Retrieval Augmented Generation (Experimental)](#retrieval-augmented-generation-experimental)
|
37 |
-
+ [Input](#input-2)
|
38 |
-
+ [Output](#output-2)
|
39 |
- [Method](#method)
|
40 |
* [Dataset Summary](#dataset-summary)
|
41 |
* [Dataset Construction](#dataset-construction)
|
@@ -46,210 +37,60 @@ To validate the effectiveness of this method, We conducted full-parameter fine-t
|
|
46 |
+ [Chinese DPO](#chinese-dpo)
|
47 |
- [SPIN/SPPO](#spinsppo)
|
48 |
- [RLAIF](#rlaif)
|
|
|
|
|
|
|
|
|
49 |
|
50 |
## Benchmark
|
51 |
|
52 |
-
###
|
53 |
|
54 |
All evaluations are based-on zero-shot.
|
55 |
|
56 |
| Metric | Kyara-2b-it | Gemma-2-2b-it |
|
57 |
|--------------------------|----------|-------------|
|
58 |
-
| **[TMMLUPlus](https://huggingface.co/datasets/ikala/tmmluplus)** | **
|
59 |
-
|  - STEM | **
|
60 |
-
|  - Humanities | **
|
61 |
-
|  - Other | **
|
62 |
-
|  - Social-Science | **
|
63 |
-
| **[MMLU-Redux](https://github.com/yuchenlin/ZeroEval)**
|
64 |
-
| **[GSM8K](https://github.com/yuchenlin/ZeroEval)**
|
65 |
-
| **[
|
66 |
-
| **[CRUX](https://github.com/yuchenlin/ZeroEval)** | **
|
67 |
-
| **
|
|
|
68 |
|
69 |
The aggregation method for the groups in TMMLUPlus is macro average, following the practice in the official implementation.
|
70 |
|
71 |
-
###
|
72 |
|
73 |
-
| Metric | Kyara
|
74 |
-
|
75 |
-
| **[AlpacaEval
|
76 |
-
| **[AlpacaEval
|
77 |
-
| **[MT-Bench-TW](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2)**
|
78 |
-
| **[MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench)**
|
79 |
-
| **[Chatbot-Arena-Hard](https://github.com/lm-sys/arena-hard-auto)** |
|
80 |
|
81 |
#### [AlignBench](https://github.com/THUDM/AlignBench)
|
82 |
|
83 |
-
| Fold
|
84 |
-
|
85 |
-
| Fundamental Language Ability
|
86 |
-
| Advanced Chinese Understanding |
|
87 |
-
| Open-ended Questions
|
88 |
-
| Writing Ability
|
89 |
-
| Logical Reasoning
|
90 |
-
| Mathematics
|
91 |
-
| Task-oriented Role Play
|
92 |
-
| Professional Knowledge
|
93 |
-
| Reasoning AVG.
|
94 |
-
| Chinage Language AVG.
|
95 |
-
| Overall
|
96 |
-
|
97 |
-
where the postfixes CHT and CHS represent Traditional Chinese and Simplified Chinese, respectively. To evaluate the performance on Traditional Chinese in AlignBench, we used [OpenCC](https://github.com/BYVoid/OpenCC) with the `
|
98 |
-
|
99 |
-
## Feature
|
100 |
-
|
101 |
-
Compared to Gemma-2-2b-it, we have introduced the following additional features:
|
102 |
-
|
103 |
-
### System Prompt
|
104 |
-
|
105 |
-
Kyara supports System Prompt, as the following example:
|
106 |
-
|
107 |
-
#### Input
|
108 |
-
|
109 |
-
```
|
110 |
-
System: 你是一個愛賣弄知識的人,總是喜歡引用典故,把簡單的事情說得非常複雜。
|
111 |
-
User: 什麼是大型語言模型?
|
112 |
-
```
|
113 |
-
|
114 |
-
#### Output
|
115 |
-
|
116 |
-
```
|
117 |
-
大型語言模型(Large Language Model,簡稱LLM)是一種基於深度學習技術的自然語言處理(NLP)模型。這些模型通過大量文本數據進行訓練,以學習和生成人類語言。具體來說,它們屬於一種被稱為「變換器」(Transformer)的架構。
|
118 |
-
|
119 |
-
變換器最初由Vaswani等人在2017年的論文《Attention is All You Need》中提出,這一架構因其在處理序列數據上的卓越性能而迅速風靡。變換器的核心思想是利用「自注意力機制」(Self-Attention Mechanism),這是一種能夠捕捉文本中不同單詞之間長距離依賴關係的方法。
|
120 |
-
|
121 |
-
大型語言模型通常具有以下幾個特徵:
|
122 |
-
|
123 |
-
1. **訓練數據量巨大**:這些模型通常需要數百GB甚至TB級的訓練數據,以捕捉語言的豐富性和多樣性。
|
124 |
-
|
125 |
-
2. **模型規模龐大**:這意味著模型包含大量參數,常見的規模包括幾十億到數萬億個參數。這些參數的數量決定了模型的複雜度和生成文本的質量。
|
126 |
-
|
127 |
-
3. **多語言能力**:現代的LLM如GPT-3、BERT等,不僅能夠處理多種語言,還能夠在不同語言之間轉換和理解上下文。
|
128 |
-
|
129 |
-
4. **應用廣泛**:這些模型在文本生成、情感分析、問答系統、機器翻譯、摘要生成等多個NLP任務中都有廣泛應用。
|
130 |
-
|
131 |
-
例如,OpenAI的GPT-3包含1750億個參數,這使其成為一個極其強大的語言生成模型。BERT則以其雙向編碼器和掩碼語言建模技術,在多種下游任務中取得了卓越的成績。
|
132 |
-
|
133 |
-
總結來說,大型語言模型通過大量數據訓練,利用複雜的深度學習架構,能夠生成高質量的自然語言文本,從而在各種NLP應用中發揮重要作用。
|
134 |
-
```
|
135 |
-
|
136 |
-
#### Input
|
137 |
-
|
138 |
-
```
|
139 |
-
System: 你是一名小學老師,會用五歲小孩也能懂的口吻回答使用者的問題
|
140 |
-
User: 什麼是大型語言模型?
|
141 |
-
```
|
142 |
-
|
143 |
-
#### Output
|
144 |
-
|
145 |
-
```
|
146 |
-
好的!我來用簡單的方式解釋一下「大型語言模型」這個概念。
|
147 |
-
|
148 |
-
想像一下,你有一本非常大且詳盡的書,裡面寫了很多很多的故事、知識和資訊。這些故事和知識就像是巨大的語言倉庫,裡面有各種各樣的詞語和句型。
|
149 |
-
|
150 |
-
大型語言模型(Large Language Model),簡單來說,就是一種非常聰明的機器人,它學習了這本書裡的大量文字,然後學會了如何用這些文字來回答問題、寫故事、解釋道理等等。
|
151 |
-
|
152 |
-
這個機器人的特別之處在於它「看過」了很多很多東西,所以它可以理解很多事情,並用很流暢的語言給你解釋。但要記住,它還是機器,不是真的人,所以有時它的回答可能還不夠完美。
|
153 |
-
|
154 |
-
��個例子,如果我們問它「什麼是太陽?」,它會從它學到的知識中找答案,比如說:「太陽是我們太陽系中心的一顆黃矮星,它給地球提供光和熱,讓植物能夠做光合作用。」
|
155 |
-
|
156 |
-
希望這樣的解釋能讓你明白!有其他問題也可以隨時問我。
|
157 |
-
```
|
158 |
-
|
159 |
-
### Retrieval Augmented Generation (Experimental)
|
160 |
-
|
161 |
-
Benefiting from Kyara's training method, we incorporated RAG-related content during the SFT phase. You can refer to the following examples to construct task templates:
|
162 |
-
|
163 |
-
#### Input
|
164 |
-
|
165 |
-
```
|
166 |
-
# 參考文件
|
167 |
-
<reference>
|
168 |
-
<document>
|
169 |
-
文件ID:id_816fcfd8
|
170 |
-
文章標題:口呼吸
|
171 |
-
文章內文:
|
172 |
-
**口呼吸**指的是用嘴呼吸的行為,這通常是由於鼻子呼吸受阻而引起,鼻子是人體的先天呼吸器官。慢性口呼吸可能與某些疾病有關。
|
173 |
-
## 參考文獻
|
174 |
-
</document><document>
|
175 |
-
文件ID:id_6c0f7501
|
176 |
-
文章標題:口角炎
|
177 |
-
文章內文:
|
178 |
-
**口角炎**(英語:**Angular cheilitis or Angular Stomatitis, perlèche**),或稱爛嘴角,為發生在嘴唇一側或兩側角落部位的炎症,通常為兩側同時發炎。此症是唇炎(cheilitis)的一種形式,發炎部位皮膚通常會紅腫、脫皮及結痂,也可能會造成發癢或疼痛,症狀可持續數天甚至可達數年之久。
|
179 |
-
口角炎可能因感染、剌激或是過敏而引起。感染源包含如白色念珠菌等真菌以及如金黃色葡萄球菌等細菌。剌激源包含配帶不適當的假牙、舔嘴唇、用嘴巴呼吸導致的嘴部乾燥、日曬、嘴部過度閉合、抽菸,以及輕微創傷。過敏源則包含牙膏、化妝品及食品等物質。其他病因可能包含營養不良或免疫功能不良 。會發生此症通常是多重因素作用的結果,對患者進行感染及皮膚過敏源測試將有助於診斷肇因。
|
180 |
-
口角炎的治療一般而言是在找出肇因後使用適當的防護霜(護膚膏),也常嘗試使用抗黴菌及抗細菌軟膏加以治療。此症可說是相當常見的疾病,據估計在美國約有 0.7% 的人受到此症影響 。口角炎好發於 30 至 60 歲間的人,但在孩童身上也相對常見。在開發中國家,缺乏鐵質及維他命是此症常見的肇因。長期處於潮濕反而會使口角炎更嚴重,嘴角更乾燥,且脫皮愈嚴重,適當保持乾燥才能讓細菌壞死。
|
181 |
-
## 病因
|
182 |
-
因病因不同而分為營養不良性口角炎、球菌性口角炎、真菌性口角炎。
|
183 |
-
營養不良性口角炎多為缺乏維生素 B 族,多為 B2 核黃素或 B12 鈷胺素,造成的嘴角貧血。以及缺鐵、缺鋅。
|
184 |
-
球菌性口角炎和真菌性口角炎為細菌或真菌感染,細菌或真菌被帶到嘴角後在濕潤的環境下容易形成炎症,因此應使嘴角儘量乾燥,使細菌不易存活。原因可能為過曬或過干(舔嘴角),機械原因如閉合不當,假牙不適或老年掉牙過度閉合,及嘴角流涎造成的。
|
185 |
-
## 症狀
|
186 |
-
單側或兩側口角濕白色,紅腫,潰爛,結痂,有燒灼感。口角發緊,運動開裂。
|
187 |
-
## 診斷
|
188 |
-
營養不良性口角炎可能會有舌,口腔,陰部黏膜等全身性的相應症狀。或伴有膿疱,多與化膿球菌感染有關。
|
189 |
-
## 治療
|
190 |
-
營養不良性口角炎直接施與 B 族維生素即可,球菌性口角炎藥物一般處方抗生素,而真菌性口角炎藥物則處方抗真菌藥物。
|
191 |
-
## 參考資料
|
192 |
-
</document><document>
|
193 |
-
文件ID:id_a214252f
|
194 |
-
文章標題:打呼
|
195 |
-
文章內文:
|
196 |
-
**鼻鼾**是呼吸系統的結構震動而產生的聲音,原因是睡覺時呼吸被阻擋。在一些情況下聲音較輕,但一般情況下都是嘈吵及煩人的。鼻鼾同時可能是睡眠窒息症的第一個警號。研究指出鼻鼾是睡眠不足的一項因素。
|
197 |
-
## 名稱
|
198 |
-
表示發出鼾聲的意思,可用「打鼾」,或俗「打呼」「打呼嚕」。「鼾」其字其音,至少在東漢以前就訓爲鼾聲的意思。
|
199 |
-
## 成因
|
200 |
-
鼻鼾通常由於懸雍垂和軟顎鬆弛而引起,鬆弛的軟組織會令到氣管阻塞或不暢通,導致不規則的氣流和振動。以下的情況都可能是鼻鼾的成因:
|
201 |
-
* 生活習慣,如吸菸、酗酒或濫藥
|
202 |
-
* 嚥喉無力,導致睡眠期間嚥喉關閉
|
203 |
-
* 顎位不對齊,通常是由於肌肉緊張所致
|
204 |
-
* 肥胖症,過多脂肪積聚於嚥喉附近
|
205 |
-
* 鼻腔阻塞
|
206 |
-
* 阻塞性睡眠呼吸暫停
|
207 |
-
* 睡眠不足
|
208 |
-
* 仰睡
|
209 |
-
* 以口呼吸
|
210 |
-
## 參考資料
|
211 |
-
</document><document>
|
212 |
-
文件ID:id_964dde34
|
213 |
-
文章標題:人工呼吸
|
214 |
-
文章內文:
|
215 |
-
**口對口人工呼吸**是人工呼吸的一種形式,是人們協助或刺激他人呼吸的行為,人們在人工呼吸時要用嘴抵住被救援者的嘴,然後將空氣吹入被救援者的肺部。通常無法自主呼吸的人需要人工呼吸。18 世紀末,英國的醫生開始積極普及人工呼吸。
|
216 |
-
## 參考文獻
|
217 |
-
</document>
|
218 |
-
</reference>
|
219 |
-
|
220 |
-
---
|
221 |
-
|
222 |
-
# 任務說明
|
223 |
-
|
224 |
-
你是一名研究員,請詳細閱讀上述參考文件,以通俗易懂的繁體中文回答使用者的問題。
|
225 |
-
在回答問題時,請注意以下幾點:
|
226 |
-
1. 請適當的援引參考文件的資訊,來佐證你的回答論述,援引方式如撰寫論文時在引用參考資料一般,可參考例子:
|
227 |
-
|
228 |
-
假設我們有文章1 id: dq945509 與文章2 id: 6a178c5,如果我們想進行引述時規格如下:這是一段事實論述【dq945509】,而這是另一段事實論述【6a178c5】。
|
229 |
-
|
230 |
-
2. **請注意,如果發現參考文件與使用者的問題無關,或者無助於回答使用者的問題,請直接回答:「參考文件缺乏與問題相關資訊」,不要杜撰無關的回答。**
|
231 |
-
|
232 |
-
# 使用者問題
|
233 |
-
為什麼說用嘴呼吸不健康?
|
234 |
-
```
|
235 |
-
|
236 |
-
#### Output
|
237 |
-
|
238 |
-
```
|
239 |
-
用嘴呼吸不健康的原因主要有以下幾點:
|
240 |
-
|
241 |
-
1. **乾燥和刺激**:用嘴呼吸會使口腔內部和舌頭變得非常乾燥,容易引起口腔黏膜的刺激和炎症。例如,口角炎就是因為嘴部乾燥、不適當的環境(如用嘴巴呼吸)導致的炎症【id_6c0f7501】。
|
242 |
-
|
243 |
-
2. **感染風險增加**:用嘴呼吸會使口腔內的細菌和病毒更容易進入呼吸道,增加感染的風險。例如,口角炎的病因之一就是因為口腔內的細菌或真菌感染【id_6c0f7501】。
|
244 |
-
|
245 |
-
3. **睡眠質量下降**:用嘴呼吸會導致打呼,這是因為懸雍垂和軟顎鬆弛,導致氣管阻塞或不暢通,從而引起打鼾,進而影響睡眠質量【id_a214252f】。
|
246 |
-
|
247 |
-
4. **慢性問題**:長期用嘴呼吸可能引發慢性口呼吸,這會對健康產生多方面的影響【id_816fcfd8】。
|
248 |
-
|
249 |
-
因此,為了維護呼吸道健康,應儘量通過鼻子進行呼吸,以減少上述問題的發生。
|
250 |
-
```
|
251 |
-
|
252 |
-
Please refer to [Kyara-RAG](https://github.com/zake7749/kyara-rag) for more details.
|
253 |
|
254 |
## Method
|
255 |
|
@@ -257,15 +98,15 @@ The following sections provide a brief summary of Kyara's implementation strateg
|
|
257 |
|
258 |
### Dataset Summary
|
259 |
|
260 |
-
We have collected a total of
|
261 |
|
262 |
* Language:
|
263 |
|
264 |
-
<img src="https://i.imgur.com/
|
265 |
|
266 |
* Conversation Rounds:
|
267 |
|
268 |
-
<img src="https://i.imgur.com/
|
269 |
|
270 |
### Dataset Construction
|
271 |
|
@@ -296,7 +137,7 @@ After completing supervised learning using the base dataset, we will fine-tune t
|
|
296 |
2. We used various LLMs in the previous step to introduce knowledge diversity and language adaptability. However, we discovered subtle differences in response templates and reasoning approaches between different LLMs, leading to occasional instability in the trained Chat Model. Therefore, we would like to introduced a high-quality small dataset, using a single strong LLM to generate QA Pairs.
|
297 |
3. The Base Dataset includes some Q&A Pairs composed of generated queries and original document. While these data are rich in knowledge, they are relatively weak in terms of instruction following.
|
298 |
|
299 |
-
To balance data diversity and quality, we adopted a strategy similar to [InsTag](https://arxiv.org/abs/2308.07074) to classify the data. We then used [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) and an LLM Judge to evaluate data quality, finally extracting the best training data from each category to create the Stage 1 Dataset of about
|
300 |
|
301 |
### Preference Learning
|
302 |
|
@@ -364,32 +205,116 @@ If the answers from A and B are very similar in terms of correctness, helpfulnes
|
|
364 |
|
365 |
Finally, all four datasets were combined for DPO training.
|
366 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
367 |
## Limitation
|
368 |
|
369 |
Like most models, Kyara still suffers from hallucinations, and possibly due to the nature of the data, it particularly enjoys quoting references when answering questions. For example, if you ask the model to explain a line from "Legend of Mortal":
|
370 |
|
371 |
```
|
372 |
-
|
373 |
```
|
374 |
|
375 |
However, the model would respond that this quote is from The "Legend of the Condor Heroes."
|
376 |
|
377 |
```
|
378 |
-
|
379 |
-
|
380 |
-
1. **「俠者,不受名利所牽」**:
|
381 |
-
- 俠者(俠士)是指那些不為名利所動搖、不追求世俗榮華富貴的人。這句話強調了他們追求的是內在的修為和道德,而非外在的名聲和利益。
|
382 |
|
383 |
-
|
384 |
-
-
|
|
|
|
|
385 |
|
386 |
-
|
387 |
-
-
|
|
|
388 |
|
389 |
-
|
390 |
-
-
|
|
|
|
|
391 |
|
392 |
-
|
393 |
```
|
394 |
|
395 |
It is recommended to exercise caution when using language models.
|
|
|
18 |
<img src="https://i.imgur.com/QiWlcYJ.jpeg" alt="kyara"/>
|
19 |
</div>
|
20 |
|
21 |
+
Kyara (Knowledge Yielding Adaptive Retrieval Augmentation) is an experimental project aimed at improving language models through knowledge retrieval processes. The project seeks to enhance the model’s ability to adapt knowledge and improve language comprehension, particularly in underrepresented languages like Traditional Chinese. Given the relatively scarce availability of Traditional Chinese data compared to the vast corpus of English data used for model training, Kyara addresses this gap by expanding the limited corpus for this language.
|
22 |
|
23 |
+
To validate the effectiveness of Kyara, we performed full-parameter fine-tuning on `Gemma-2-2b-it`, resulting in the first iteration of the Kyara model. Initial evaluation results can be found in the [Benchmark](#benchmark) section.
|
24 |
|
25 |
## Table of Content
|
26 |
|
27 |
- [Benchmark](#benchmark)
|
28 |
+
* [General Benchmark](#general-benchmark)
|
29 |
+
* [Alignment Benchmark](#alignment-benchmark)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
- [Method](#method)
|
31 |
* [Dataset Summary](#dataset-summary)
|
32 |
* [Dataset Construction](#dataset-construction)
|
|
|
37 |
+ [Chinese DPO](#chinese-dpo)
|
38 |
- [SPIN/SPPO](#spinsppo)
|
39 |
- [RLAIF](#rlaif)
|
40 |
+
- [Feature](#feature)
|
41 |
+
* [Retrieval Augmented Generation (Experimental)](#retrieval-augmented-generation-experimental)
|
42 |
+
+ [Input](#input)
|
43 |
+
+ [Output](#output)
|
44 |
|
45 |
## Benchmark
|
46 |
|
47 |
+
### General Benchmark
|
48 |
|
49 |
All evaluations are based-on zero-shot.
|
50 |
|
51 |
| Metric | Kyara-2b-it | Gemma-2-2b-it |
|
52 |
|--------------------------|----------|-------------|
|
53 |
+
| **[TMMLUPlus](https://huggingface.co/datasets/ikala/tmmluplus)** | **41.98** | 36.73 |
|
54 |
+
|  - STEM | **43.73** | 37.84 |
|
55 |
+
|  - Humanities | **38.72** | 33.40 |
|
56 |
+
|  - Other | **40.61** | 36.00 |
|
57 |
+
|  - Social-Science | **44.88** | 39.69 |
|
58 |
+
| **[MMLU-Redux](https://github.com/yuchenlin/ZeroEval)** | **55.44**| 51.94 |
|
59 |
+
| **[GSM8K](https://github.com/yuchenlin/ZeroEval)** | **54.21**| 51.63 |
|
60 |
+
| **[MATH-L5](https://github.com/yuchenlin/ZeroEval)** | **8.88**| 4.3 |
|
61 |
+
| **[CRUX](https://github.com/yuchenlin/ZeroEval)** | **22.75**| 21.5 |
|
62 |
+
| **[ZebraLogic](https://github.com/yuchenlin/ZeroEval)** | **5.2**| 4.2 |
|
63 |
+
| **Chinese-Reason-Bench** | **4.21** | 3.44 |
|
64 |
|
65 |
The aggregation method for the groups in TMMLUPlus is macro average, following the practice in the official implementation.
|
66 |
|
67 |
+
### Alignment Benchmark
|
68 |
|
69 |
+
| Metric | Kyara | Gemma-2-2b-it | ChatGPT-3.5-1106 |
|
70 |
+
|--------------------------|----------|---------------|------------------|
|
71 |
+
| **[AlpacaEval-LC](https://github.com/tatsu-lab/alpaca_eval)** | **35.35** | 32.37 | 19.30 |
|
72 |
+
| **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)** | **43.34** | 32.94 | 9.20 |
|
73 |
+
| **[MT-Bench-TW](https://huggingface.co/datasets/MediaTek-Research/TCEval-v2)** | **7.43** | 6.35 | 7.10 |
|
74 |
+
| **[MT-Bench](https://huggingface.co/spaces/lmsys/mt-bench)** | 8.28 | 8.17 | **8.32** |
|
75 |
+
| **[Chatbot-Arena-Hard](https://github.com/lm-sys/arena-hard-auto)** | **22.60** | 19.4 | 18.87 |
|
76 |
|
77 |
#### [AlignBench](https://github.com/THUDM/AlignBench)
|
78 |
|
79 |
+
| Fold | Kyara-2b-it-CHT | Kyara-2b-it-CHS | Gemma-2-2b-it | ChatGPT-3.5-0613 |
|
80 |
+
|-------|-----------------|-----------------|---------------| ---- |
|
81 |
+
| Fundamental Language Ability | 6.72 | 6.54 | 6.42 | **6.92** |
|
82 |
+
| Advanced Chinese Understanding | 5.78 | 5.24 | 5.03 | **5.91** |
|
83 |
+
| Open-ended Questions | **8.16** | 7.79 | 7.52 | 6.47 |
|
84 |
+
| Writing Ability | **7.90** | 7.24 | 7.76 | 7.28 |
|
85 |
+
| Logical Reasoning | **5.26** | 4.27 | 4.20 | 4.79 |
|
86 |
+
| Mathematics | **5.99** | 5.44 | 5.05 | 5.38 |
|
87 |
+
| Task-oriented Role Play | **8.07** | 8.00 | 7.42 | 7.00 |
|
88 |
+
| Professional Knowledge | **6.97** | 6.86 | 5.79 | 6.81 |
|
89 |
+
| Reasoning AVG. | **5.62** | 4.85 | 4.63 | 5.00 |
|
90 |
+
| Chinage Language AVG. | **7.26** | 6.94 | 6.66 | 6.73 |
|
91 |
+
| Overall | **6.44** | 5.90 | 5.64 | 5.91 |
|
92 |
+
|
93 |
+
where the postfixes CHT and CHS represent Traditional Chinese and Simplified Chinese, respectively. To evaluate the performance on Traditional Chinese in AlignBench, we used [OpenCC](https://github.com/BYVoid/OpenCC) with the `s2tw` configuration to convert all questions from Simplified Chinese to Traditional Chinese.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
94 |
|
95 |
## Method
|
96 |
|
|
|
98 |
|
99 |
### Dataset Summary
|
100 |
|
101 |
+
We have collected a total of 3.6M conversations, approximately 4.51 billion tokens. The following provides an overview of the language distribution and conversation rounds.
|
102 |
|
103 |
* Language:
|
104 |
|
105 |
+
<img src="https://i.imgur.com/fhD5kIy.png" alt="language-distribution" width="500"/>
|
106 |
|
107 |
* Conversation Rounds:
|
108 |
|
109 |
+
<img src="https://i.imgur.com/CWQ2shj.png" alt="conv-round-distribution" width="500"/>
|
110 |
|
111 |
### Dataset Construction
|
112 |
|
|
|
137 |
2. We used various LLMs in the previous step to introduce knowledge diversity and language adaptability. However, we discovered subtle differences in response templates and reasoning approaches between different LLMs, leading to occasional instability in the trained Chat Model. Therefore, we would like to introduced a high-quality small dataset, using a single strong LLM to generate QA Pairs.
|
138 |
3. The Base Dataset includes some Q&A Pairs composed of generated queries and original document. While these data are rich in knowledge, they are relatively weak in terms of instruction following.
|
139 |
|
140 |
+
To balance data diversity and quality, we adopted a strategy similar to [InsTag](https://arxiv.org/abs/2308.07074) to classify the data. We then used [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) and an LLM Judge to evaluate data quality, finally extracting the best training data from each category to create the Stage 1 Dataset of about 500K, which was used to fine-tune the Kyara-SFT Model again.
|
141 |
|
142 |
### Preference Learning
|
143 |
|
|
|
205 |
|
206 |
Finally, all four datasets were combined for DPO training.
|
207 |
|
208 |
+
## Feature
|
209 |
+
|
210 |
+
### Retrieval Augmented Generation (Experimental)
|
211 |
+
|
212 |
+
Benefiting from Kyara's training method, we incorporated RAG-related content during the SFT phase. You can refer to the following examples to construct task templates:
|
213 |
+
|
214 |
+
#### Input
|
215 |
+
|
216 |
+
```
|
217 |
+
# Reference Document
|
218 |
+
<reference>
|
219 |
+
<document>
|
220 |
+
Document ID: id_27025b13
|
221 |
+
* Document Title: Flash_memory
|
222 |
+
* Document Text:
|
223 |
+
Another limitation of flash memory is its limited number of erase cycles (most commercial SLC flash memory guarantees around 100,000 erase cycles for the "0" zone, but due to manufacturing precision, other blocks are not guaranteed, and some might even have factory defects and be unusable). This limitation is partly offset by firmware or file systems that calculate write counts and perform dynamic remapping to distribute writes across different blocks; this technique is called wear leveling. Another method is known as Bad Block Management (BBM), where blocks are dynamically tested during write operations, and failed blocks are discarded. For most mobile devices, these wear management techniques can extend the life of internal flash memory (sometimes even beyond the device's lifespan). Moreover, partial data loss in these devices may be acceptable. However, for high-reliability data storage applications that require heavy data write cycles, flash memory is not recommended. But this limitation does not apply to read-only applications, such as routers and thin clients, which often only write once or a few times throughout their lifespan.
|
224 |
+
|
225 |
+
### Read Disturbance
|
226 |
+
</document>
|
227 |
+
<document>
|
228 |
+
Document ID: id_858b1787
|
229 |
+
* Document Title: Flash_memory
|
230 |
+
* Document Text:
|
231 |
+
* TLC NAND flash memory typically has an endurance of around 1,000 or more cycles (Samsung 840); using multi-layer structures and adopting LDPC correction have extended the endurance.
|
232 |
+
* QLC NAND flash memory can have an endurance ranging from 500 to 1,000 cycles.
|
233 |
+
* SLC floating-gate NOR flash memory typically has a write endurance of 100,000 to 1,000,000 cycles (Numonyx M58BW 100k; Spansion S29CD016J 1,000k).
|
234 |
+
* MLC floating-gate NOR flash memory usually has a write endurance of 100,000 cycles (Numonyx J3 flash).
|
235 |
+
|
236 |
+
These values are approximate and depend on the technology and positioning of different manufacturers' products. Finer process technologies can improve read/write performance and capacity, but they may also pose greater challenges in terms of write endurance. Specific algorithms and design examples, such as wear leveling and memory over-provisioning, can be used to adjust storage system endurance to meet specific needs. Wear leveling is essential for ensuring the lifespan of flash memory products, and it is supported in products like USB flash drives and SSDs.
|
237 |
+
|
238 |
+
## Flash Memory File Systems
|
239 |
+
</document>
|
240 |
+
<document>
|
241 |
+
Document ID: id_df34eb65
|
242 |
+
* Document Title: Memory_over-provisioning
|
243 |
+
* Document Text:
|
244 |
+
## Basic SSD Operations
|
245 |
+
|
246 |
+
Due to the nature of flash memory operations, data cannot be overwritten directly like in hard drives. When data is first written to an SSD, the cells are in an erased state, so the data can be written directly, one page at a time (usually 4 to 8 KB in size). The SSD controller, which manages the flash memory and interfaces with the main control system, uses a logical-to-physical mapping system called Logical Block Addressing (LBA), part of the flash translation layer (FTL). When new data needs to replace old data, the SSD controller writes the new data to a new location and updates the logical mapping to point to the new physical location. The original data becomes invalid and must be erased before it can be rewritten.
|
247 |
+
|
248 |
+
Flash memory has a limited number of program/erase (P/E) cycles. Typically, this is expressed as the maximum number of P/E cycles that flash memory can endure over its lifetime. Single-level cell (SLC) flash memory is generally designed for high performance and long life, typically supporting 50,000 to 100,000 cycles. As of 2011, multi-level cell (MLC) flash memory, designed for low-cost applications, has far fewer cycles, usually only 3,000 to 5,000 cycles. Since 2013, triple-level cell (TLC) flash memory has been introduced, with P/E cycles dropping to around 1,000. The lower the write amplification, the better, as it corresponds to fewer P/E cycles, which extends the lifespan of the SSD.
|
249 |
+
</document>
|
250 |
+
</reference>
|
251 |
+
|
252 |
+
---
|
253 |
+
|
254 |
+
# Task Description
|
255 |
+
|
256 |
+
Please refer to the content in the <reference> above and answer the user's question.
|
257 |
+
|
258 |
+
When answering, pay attention to the following points:
|
259 |
+
|
260 |
+
1. Appropriately cite the reference document ID to support your argument. For example:
|
261 |
+
|
262 |
+
If we have Article 1 with ID: dq945509 and Article 2 with ID: 6a178c5, you can cite as follows: "This is a factual statement"【dq945509】 and "This is another factual statement"【6a178c5】.
|
263 |
+
|
264 |
+
2. If you find that the reference document lacks relevant information to the user's question, output: "The reference documents lack information relevant to the question." Do not fabricate unrelated answers.
|
265 |
+
|
266 |
+
# User's Question
|
267 |
+
|
268 |
+
How does the write limit of memory cards affect their lifespan? What techniques can extend the lifespan of memory cards in scenarios with frequent writes?
|
269 |
+
```
|
270 |
+
|
271 |
+
#### Output
|
272 |
+
|
273 |
+
```
|
274 |
+
The write limit of memory cards, particularly flash memory, significantly affects their lifespan due to the limited number of program/erase (P/E) cycles they can endure. For instance, single-level cell (SLC) flash memory, which is designed for high performance and long life, typically supports 50,000 to 100,000 cycles【id_df34eb65】. In contrast, multi-level cell (MLC) flash memory, which is more cost-effective, has fewer cycles, usually around 3,000 to 5,000 cycles【id_df34eb65】. As of 2013, triple-level cell (TLC) flash memory has even lower endurance, with around 1,000 cycles【id_df34eb65】.
|
275 |
+
|
276 |
+
To mitigate the impact of these limitations, several techniques can extend the lifespan of memory cards in scenarios with frequent writes:
|
277 |
+
|
278 |
+
1. **Wear Leveling**: This technique involves distributing writes across different blocks to prevent any single block from being overworked. Wear leveling is crucial for ensuring the lifespan of flash memory products, and it is supported in products like USB flash drives and SSDs【id_858b1787】.
|
279 |
+
|
280 |
+
2. **Bad Block Management (BBM)**: In this method, blocks are dynamically tested during write operations, and failed blocks are discarded. This ensures that only healthy blocks are used for data storage, thereby extending the overall lifespan of the memory card【id_27025b13】.
|
281 |
+
|
282 |
+
3. **Memory Over-Provisioning**: This involves allocating extra storage space beyond the actual capacity of the flash memory to account for wear and tear. This can help to maintain performance and extend the lifespan of the memory card by allowing for more writes before reaching the physical limits of the memory cells【id_df34eb65】.
|
283 |
+
|
284 |
+
These techniques are particularly important for mobile devices and applications that require frequent data writes, such as routers and thin clients, where the partial data loss in these devices may be acceptable【id_27025b13】.
|
285 |
+
```
|
286 |
+
|
287 |
+
Please refer to [Kyara-RAG](https://github.com/zake7749/kyara-rag) for more details.
|
288 |
+
|
289 |
+
|
290 |
## Limitation
|
291 |
|
292 |
Like most models, Kyara still suffers from hallucinations, and possibly due to the nature of the data, it particularly enjoys quoting references when answering questions. For example, if you ask the model to explain a line from "Legend of Mortal":
|
293 |
|
294 |
```
|
295 |
+
該如何理解:「俠者,不受名利所牽,但求俯仰無愧。正邪存乎在我,聲名留問諸君。非我桀傲不馴,乃因唐門薪火,自古如今!」
|
296 |
```
|
297 |
|
298 |
However, the model would respond that this quote is from The "Legend of the Condor Heroes."
|
299 |
|
300 |
```
|
301 |
+
這句話出自《神雕俠侶》中的《神雕手記》,是楊過對自己處世態度的一種表達。讓我們逐句拆解理解:
|
|
|
|
|
|
|
302 |
|
303 |
+
1. **「俠者,不受名利所牽,但求俯仰無愧。」**
|
304 |
+
- **俠者**:指俠義之士,講究義氣和正義的英雄。
|
305 |
+
- **不受名利所牽**:不受外在名利和利益的影響,保持自我的正直。
|
306 |
+
- **但求俯仰無愧**:只求自己行動的坦蕩和內心的清淨,不求外界的認可或評價。
|
307 |
|
308 |
+
2. **「正邪存乎在我,聲名留問諸君。」**
|
309 |
+
- **正邪存乎在我**:善惡的歸屬在於自己,決策和行動的結果在於自己。
|
310 |
+
- **聲名留問諸君**:聲名是別人會問津的,但自己對正義的追求和內��的正直,才是最值得自豪和珍視的。
|
311 |
|
312 |
+
3. **「非我桀傲不馴,乃因唐門薪火,自古如今!」**
|
313 |
+
- **非我桀傲不馴**:並不是我桀驁不馴(驕傲自負),而是因唐門的榮耀和傳承,我必須遵循這些責任。
|
314 |
+
- **乃因唐門薪火**:是因為唐門(一種武學門派)的薪火相傳,這是一份深重的責任。
|
315 |
+
- **自古如今**:自古至今,俠義精神和門派的責任感,一直是人們所崇敬的,這是歷史和傳統的延續。
|
316 |
|
317 |
+
總結起來,這段話表達了楊過對於自己行為的獨特理解和自豪感。他明白自己的行為和價值觀取決於個人的內心和對正義的追求,而非外界的評價和名利。他也承認了自己的責任,作為唐門下一代,必須繼承和發揚門風,這一點是無可替代的。
|
318 |
```
|
319 |
|
320 |
It is recommended to exercise caution when using language models.
|