zake7749
/

gemma-2-2b-it-chinese-kyara-dpo

@@ -3,20 +3,19 @@ library_name: transformers
 language:
 - zh
 ---
 # Kyara: Knowledge Yielding Adaptive Retrieval Augmentation for LLM Fine-tuning
 <p align="left">
-    🤗 <a href="https://huggingface.co/zake7749/gemma-2-2b-it-chinese-kyara-dpo">Hugging Face</a>&nbsp ｜ 🚀<a href="https://github.com/zake7749/kyara">Github</a>&nbsp ｜ &nbsp📑 <a href="#">Paper</a>&nbsp ｜ &nbsp📖 <a href="#">English</a>
 </p>
 <div style="text-align: center;">
   <img src="https://i.imgur.com/QiWlcYJ.jpeg" alt="kyara"/>
 </div>
-Kyara 是一個實驗性的語言模型微調策略，旨在通過知識檢索增強來有效擴展模型的知識範圍與語言理解能力。
-與此同時，Kyara 也致力於填補中文語料庫，特別是繁體中文的空白。在當前語言模型研究中，英文資料豐富多樣，中文卻面臨語料匱乏的挑戰，這無疑為學術研究設立了一道難以逾越的高牆。
-為了驗證這一方法的有效性，我在 Gemma-2-2b-it 模型上進行了全參數微調，產生了第一版的 Kyara 模型。初步評估結果可參見 [Benchmark](#benchmark)。
 ## Table of Content
@@ -47,9 +46,9 @@ Kyara 是一個實驗性的語言模型微調策略，旨在通過知識檢索
 ### **General Benchmark**
-所有的評測皆採用 zero-shot 的方式進行。TMMLUPlus 中分數聚合的方式沿用官方設計 (macro-average)。
-| Metric                   | Kyara-2b-it    | Gemma-2b-it |
 |--------------------------|----------|-------------|
 | **[TMMLUPlus](https://huggingface.co/datasets/ikala/tmmluplus)**            | **39.22** | 36.73    |
 | &emsp;- STEM               | **40.86**   | 37.84      |
@@ -62,10 +61,11 @@ Kyara 是一個實驗性的語言模型微調策略，旨在通過知識檢索
 | **[CRUX](https://github.com/yuchenlin/ZeroEval)**          | **21.25**| 20.75     |
 | **Chinese-Reason-Bench** | **3.97** | 3.44        |
 ### **Alignment Benchmark**
-| Metric                   | Kyara    | Gemma-2b-it |
 |--------------------------|----------|-------------|
 | **[AlpacaEval-LC](https://github.com/tatsu-lab/alpaca_eval)**        | **41.15**| 32.37       |
 | **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)**           | **43.95**| 32.94       |
@@ -74,57 +74,46 @@ Kyara 是一個實驗性的語言模型微調策略，旨在通過知識檢索
 ## Feature
-相比於原生的 `gemma-2-2b-it`, 我們額外引入了以下兩個功能。
 ### System Prompt
-Kyara 支援 System Prompt, 可參考以下範例：
 #### Input
 ```
-messages = [
-    {'role': 'user', 'content': '什麼是大型語言模型？'}
-]
 ```
 #### Output
 ```
-大型語言模型（Large Language Models, LLMs）是一類深度學習模型，主要用於理解和生成自然語言文本。這些模型基於大量的語料數據進行訓練，並使用複雜的神經網絡架構，如Transformer架構。以下是一些關鍵特點和應用：
-1. **訓練數據龐大**：
-   - 大型語言模型通常需要處理數以百萬計甚至數十億個字詞和句子。這使它們能夠捕捉到語言中的豐富多樣性。
-2. **Transformer架構**：
-   - Transformer架構在處理序列數據（如文本）時特別有效。它利用自注意力機制，能夠在生成文本過程中考慮到遠距離的依賴關係。
-3. **預訓練和微調**：
-   - **預訓練**：最初在大規模未標註語料上進行訓練，以學習基本的語言結構和知識。
-   - **微調**：在特定任務或應用場景上進行細化訓練。
-4. **應用範圍廣泛**：
-   - **文本生成**：如自動寫作、對話系統等。
-   - **語言翻譯**：自動翻譯文本。
-   - **問答系統**：回答用戶的提問。
-   - **文本分類**：自動歸類和篩選文本。
-   - **情感分析**：分析文本中的情感傾向。
-5. **性能卓越**：
-   - 大型語言模型通常在多個自然語言處理（NLP）基準測試中表現出色，超越了許多先前的模型。
-幾個著名的例子包括GPT-3（由OpenAI開發）、BERT（由Google開發）和T5（由Google和Facebook開發）。
-大型語言模型的發展標誌著NLP領域的一個重大突破，雖然它們有強大的能力，但同時也引發了一些倫理和安全方面的問題，如生成不準確或有害內容的風險。因此，研究人員正在不斷探索如何提高這些模型的準確性和安全性。
 ```
 #### Input
 ```
-messages = [
-    {'role': 'system', 'content': '你是一名小學老師，會用五歲小孩也能懂的口吻回答使用者的問題'},
-    {'role': 'user', 'content': '什麼是大型語言模型？'}
-]
 ```
 #### Output
@@ -145,7 +134,7 @@ messages = [
 ### Retrieval Augmented Generation (Experimental)
-受益於 Kyara 的訓練方式，我們在 SFT 階段加入 RAG 相關語料，可參考以下例子構建任務模板:
 #### Input
@@ -236,21 +225,21 @@ messages = [
 因此，為了維護呼吸道健康，應儘量通過鼻子進行呼吸，以減少上述問題的發生。
 ```
-更多詳細資訊，請參考 [Kyara-RAG](https://github.com/zake7749/kyara-rag)
 ## Method
-接下來的段落，我將簡短摘要 Kyara 的實作策略，完成的文件仍在趕工中，還請見諒 :;(∩´﹏`∩);:
 ### Dataset Summary
-我們一共收集了 2,619,021 則對話，約 3.54B tokens. 以下是根據資料源統計的語言分佈以及資料輪次的概要：
-* 語言分佈：
 <img src="https://i.imgur.com/KvVjti4.png" alt="language-distribution" width="500"/>
-* 資料輪次分佈：
 <img src="https://i.imgur.com/dekAnU0.png" alt="conv-round-distribution" width="500"/>
@@ -258,55 +247,55 @@ messages = [
 #### Base Dataset: Knowledge Injection with Retrieval Augmentation
-我們基於開放中文知識型語料庫如 [bigscience-data-wikipedia](https://huggingface.co/datasets/bigscience-data/roots_zh-tw_wikipedia) 搭配 [QDrant](https://qdrant.tech/) 構建了一個知識搜尋系統，並透過以下流程構建 SFT Pairs:
-1. 從知識庫中採樣文本，針對該文本生成使用者可能詢問的知識密集型問題。
-2. (可選) 基於 [Evol-Instruct](https://arxiv.org/pdf/2304.12244) 使指令複雜化，進行追問、分析、要求特定的輸出規格，或添加情境假設等。
-4. 對於生成的指令進行 Query Expansion，額外召回 Top K 篇文件，並個別判斷這些文件是否相關：
-    * 如果相關，請 LLM 針對問題摘要出重點資訊。
-    * 如果無關，則忽略該份文件
-5. 請 LLM 閱讀原始文件以及至多 K 份輔助資料，撰寫詳盡且完整的回答。
-除此之外，針對高品質文本，我們會直接請 LLM 猜測是什麼樣的 prompt 能夠使 LLM 產出這份文本，並將其加入訓練資料。
 ##### Chinese Math Dataset
-* 資料集：[zake7749/kyara-chinese-math-sft-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-math-sft-s0-30K)
-上述策略雖然能廣泛生成知識性文本，但主要屬於資訊尋求（Information Seeking）的任務範疇，不太能有效構建數學與推理相關的語料。為此，我們基於 [PersonaHub](https://huggingface.co/datasets/proj-persona/PersonaHub) 產生了 5 萬則數學題，並借助 `Gemini-1.5-Flash` 濾除在計算與推理思維上明顯出錯的資料，以此構建出 [zake7749/kyara-chinese-math-sft-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-math-sft-s0-30K)
 #### High Quality Dataset: Model Refinement
-在使用上述資料完成監督學習後，我們會將 LLM 再一次微調於一個高品質的 subset 上，主要是為了處理以下三個問題：
-1. Base Dataset 中有些回應是由小型 LLM 生成的，有時在指令跟隨上表現不佳。
-2. 在實驗初期，混用了不同 LLM 以擴充知識多樣性和語言適應性。然而，發現不同 LLM 的 response template 以及回答思維有些微妙差異，導致訓練後的 Chat Model 有時不太穩定。因此，我們引入了一個高品質的小型資料集，採用同一個 LLM 生成 QA Pairs。
-3. Base Dataset 中包含部分給定原文並要求模型猜測 User Queries 所組成的 Q&A Pairs，這些資料點雖知識豐富，但在指令跟隨上相對弱勢。
-為兼顧資料多樣性與品質，我們採用了類似 [InsTag](https://arxiv.org/abs/2308.07074)  的策略將資料分門別類。再使用 [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) 以及一個 LLM Judge 來評估資料品質，最後，只抽取各類別中品質優良的訓練資料，構建出約 200K 的 Stage 1 Dataset，並將其用於再次微調 Kyara-SFT Model。
 ### Preference Learning
-除了監督式學習，我們也引入了偏好學習（Preference Learning），讓模型的回答更符合人類偏好，同時加強程式能力與數學推理能力。
-Kyara 的偏好學習策略採用了 [Direct Preference Optimization (DPO)](https://arxiv.org/pdf/2305.18290)，我們混用了兩個自製的中文資料集以及以下兩個英文資料集：
 * [argilla/ultrafeedback-binarized-preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences)
 * [xinlai/Math-Step-DPO-10K](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K)
-英文資料的設計可參考相關論文，這裡簡述中文資料集的構建方式。
 #### Chinese DPO
 ##### [SPIN/SPPO](https://github.com/uclaml/SPIN)
-我們仿照 SPIN 的設計，使用 Kyara-SFT 針對 High Quality Dataset 生成了一組對比資料。
 ##### RLAIF
 資料集：[zake7749/kyara-chinese-preference-dpo-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-preference-dpo-s0-30K)
-我們自 `Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese`, `hfl/stem_zh_instruction` 與 `FreedomIntelligence/Evol-Instruct-Chinese-GPT4` 中抽取了中文 Prompt，並將同一個 prompt 分發給 4 個不同的 LLM。競爭的 LLM 包括：
 * GPT-4o
 * GPT-4-0618
@@ -318,7 +307,7 @@ Kyara 的偏好學習策略採用了 [Direct Preference Optimization (DPO)](http
 * Qwen2-72B-Instruct
 * DeepSeek V2
-生成完後，我們再透過以下 Prompt 請 LLM 判斷回答的優劣：
 ```
 **[Task]**
@@ -349,15 +338,17 @@ If the answers from A and B are very similar in terms of correctness, helpfulnes
 ---
 ```
 ## Limitation
-與多數模型一樣，Kyara 仍存在著幻覺的問題，而且可能是受資料特性的影響，在回答時特別喜歡引經據典，比如請模型解釋一句《活俠傳》中的台詞:
 ```
 如何解釋「俠者，不受名利所牽，但求俯仰無愧。正邪存乎在我，聲名留問諸君。非我桀傲不馴，乃因唐門薪火，自古如今！」
 ```
-模型的回答會是：
 ```
 這段話出自《射鵰英雄傳》中的唐門內部經典，由郭靖和黃蓉在唐門大殿上宣讀。這段話主要表達了唐門門人對於俠義精神和內在修為的追求，以及他們對外界的名利和聲名的不屑。以下是詳細解釋：
@@ -377,4 +368,4 @@ If the answers from A and B are very similar in terms of correctness, helpfulnes
 總結來說，這段話強調了唐門門人對名利和聲名的不屑，他們追求的是內在的修為和道德，不為外界的評價所動搖。這也體現了他們俠義精神和對正義的堅持。
 ```
-建議在使用時仍要謹慎判斷文章的正確性，不要全方面相信大型語言模型的輸出。

 language:
 - zh
 ---
 # Kyara: Knowledge Yielding Adaptive Retrieval Augmentation for LLM Fine-tuning
 <p align="left">
+    🤗 <a href="https://huggingface.co/zake7749/gemma-2-2b-it-chinese-kyara-dpo">Hugging Face</a>&nbsp ｜ 🚀<a href="https://github.com/zake7749/kyara">Github</a>&nbsp ｜ &nbsp📑 <a href="#">Paper</a>&nbsp ｜ &nbsp📖 <a href="#">English</a>&nbsp | &nbsp📖 <a href="https://github.com/zake7749/kyara">Chinese</a>
 </p>
 <div style="text-align: center;">
   <img src="https://i.imgur.com/QiWlcYJ.jpeg" alt="kyara"/>
 </div>
+Kyara is an experimental strategy for fine-tuning language models, designed to effectively enhance the model's knowledge adaptation and language understanding capabilities through knowledge retrieval augmentation.
+To validate the effectiveness of this method, We conducted full-parameter fine-tuning on  `Gemma-2-2b-it`, resulting in the first version of the Kyara model. Preliminary evaluation results can be seen in the [Benchmark](#benchmark) section.
 ## Table of Content
 ### **General Benchmark**
+All evaluations are based-on zero-shot.
+| Metric                   | Kyara-2b-it    | Gemma-2-2b-it |
 |--------------------------|----------|-------------|
 | **[TMMLUPlus](https://huggingface.co/datasets/ikala/tmmluplus)**            | **39.22** | 36.73    |
 | &emsp;- STEM               | **40.86**   | 37.84      |
 | **[CRUX](https://github.com/yuchenlin/ZeroEval)**          | **21.25**| 20.75     |
 | **Chinese-Reason-Bench** | **3.97** | 3.44        |
+ The aggregation method for TMMLUPlus's groups is macro average, following the practice in the official implementation.
 ### **Alignment Benchmark**
+| Metric                   | Kyara    | Gemma-2-2b-it |
 |--------------------------|----------|-------------|
 | **[AlpacaEval-LC](https://github.com/tatsu-lab/alpaca_eval)**        | **41.15**| 32.37       |
 | **[AlpacaEval](https://github.com/tatsu-lab/alpaca_eval)**           | **43.95**| 32.94       |
 ## Feature
+Compared to Gemma-2-2b-it, we have introduced the following additional features:
 ### System Prompt
+Kyara supports System Prompt, as the following example:
 #### Input
 ```
+System: 你是一個愛賣弄知識的人，總是喜歡引用典故，把簡單的事情說得非常複雜。
+User: 什麼是大型語言模型？
 ```
 #### Output
 ```
+大型語言模型（Large Language Model，簡稱LLM）是一種基於深度學習技術的自然語言處理（NLP）模型。這些模型通過大量文本數據進行訓練，以學習和生成人類語言。具體來說，它們屬於一種被稱為「變換器」（Transformer）的架構。
+變換器最初由Vaswani等人在2017年的論文《Attention is All You Need》中提出，這一架構因其在處理序列數據上的卓越性能而迅速風靡。變換器的核心思想是利用「自注意力機制」（Self-Attention Mechanism），這是一種能夠捕捉文本中不同單詞之間長距離��賴關係的方法。
+大型語言模型通常具有以下幾個特徵：
+1. **訓練數據量巨大**：這些模型通常需要數百GB甚至TB級的訓練數據，以捕捉語言的豐富性和多樣性。
+2. **模型規模龐大**：這意味著模型包含大量參數，常見的規模包括幾十億到數萬億個參數。這些參數的數量決定了模型的複雜度和生成文本的質量。
+3. **多語言能力**：現代的LLM如GPT-3、BERT等，不僅能夠處理多種語言，還能夠在不同語言之間轉換和理解上下文。
+4. **應用廣泛**：這些模型在文本生成、情感分析、問答系統、機器翻譯、摘要生成等多個NLP任務中都有廣泛應用。
+例如，OpenAI的GPT-3包含1750億個參數，這使其成為一個極其強大的語言生成模型。BERT則以其雙向編碼器和掩碼語言建模技術，在多種下游任務中取得了卓越的成績。
+總結來說，大型語言模型通過大量數據訓練，利用複雜的深度學習架構，能夠生成高質量的自然語言文本，從而在各種NLP應用中發揮重要作用。
 ```
 #### Input
 ```
+System: 你是一名小學老師，會用五歲小孩也能懂的口吻回答使用者的問題
+User: 什麼是大型語言模型？
 ```
 #### Output
 ### Retrieval Augmented Generation (Experimental)
+Benefiting from Kyara's training method, we incorporated RAG-related content during the SFT phase. You can refer to the following examples to construct task templates:
 #### Input
 因此，為了維護呼吸道健康，應儘量通過鼻子進行呼吸，以減少上述問題的發生。
 ```
+Please refer to [Kyara-RAG](https://github.com/zake7749/kyara-rag) for more details.
 ## Method
+The following sections provide a brief summary of Kyara's implementation strategy.
 ### Dataset Summary
+We have collected a total of 2.6M conversations, approximately 3.54 billion tokens. The following provides an overview of the language distribution and conversation rounds.
+* Language：
 <img src="https://i.imgur.com/KvVjti4.png" alt="language-distribution" width="500"/>
+* Conversation Rounds：
 <img src="https://i.imgur.com/dekAnU0.png" alt="conv-round-distribution" width="500"/>
 #### Base Dataset: Knowledge Injection with Retrieval Augmentation
+We developed a knowledge search system using open Chinese knowledge corpora, integrated with [QDrant](https://qdrant.tech/). To construct Supervised Fine-Tuning(SFT) pairs, we followed this process:
+1. Sample documents from the knowledge base and generate knowledge-intensive questions that users might ask based on these texts.
+2. (Optional) Increase instruction complexity using [Evol-Instruct](https://arxiv.org/pdf/2304.12244).
+3. Apply query expansion on the generated instructions to retrieve additional Top K documents and individually assess their relevance:
+   * For relevant documents, use an LLM to summarize key information related to the question.
+   * For irrelevant documents, ignore them.
+4. Let the LLM generate a detailed and comprehensive response according to the original document and K supplementary references.
+Besides, we would also aks LLM to generate an user prompt for high quality documents, and pair the (generated prompt, original document) as a SFT example.
 ##### Chinese Math Dataset
+* Dataset: [zake7749/kyara-chinese-math-sft-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-math-sft-s0-30K)
+While the aforementioned strategy can generate a wide range of knowledge-based texts, it primarily falls within the scope of information-seeking tasks and is not very effective in constructing mathematical and reasoning-related content. To address this, we generated 50,000 math problems based on [PersonaHub](https://huggingface.co/datasets/proj-persona/PersonaHub). We then used `Gemini-1.5-Flash` to filter out data with obvious errors in calculation and reasoning, thereby creating [kyara-chinese-math-sft-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-math-sft-s0-30K).
 #### High Quality Dataset: Model Refinement
+After completing supervised learning using the base dataset, we will fine-tune the LLM again on a high-quality subset, primarily to address the following three issues:
+1. Some responses in the Base Dataset were generated from small model, which sometimes performed poorly in following instructions.
+2. We used various LLMs in the previous step to introduce knowledge diversity and language adaptability. However, we discovered subtle differences in response templates and reasoning approaches between different LLMs, leading to occasional instability in the trained Chat Model. Therefore, we would like to introduced a high-quality small dataset, using a single strong LLM to generate QA Pairs.
+3. The Base Dataset includes some Q&A Pairs composed of generated queries and original document. While these data are rich in knowledge, they are relatively weak in terms of instruction following.
+To balance data diversity and quality, we adopted a strategy similar to [InsTag](https://arxiv.org/abs/2308.07074) to classify the data. We then used [ArmoRM](https://huggingface.co/RLHFlow/ArmoRM-Llama3-8B-v0.1) and an LLM Judge to evaluate data quality, finally extracting the best training data from each category to create the Stage 1 Dataset of about 200K, which was used to fine-tune the Kyara-SFT Model again.
 ### Preference Learning
+We introduced Preference Learning in Kyara, which allows the model's responses to better align with human preferences while enhancing programming skills and mathematical reasoning abilities.
+Kyara’s preference learning strategy utilizes Direct Preference Optimization (DPO), integrating two custom-built Chinese datasets alongside two English datasets.
 * [argilla/ultrafeedback-binarized-preferences](https://huggingface.co/datasets/argilla/ultrafeedback-binarized-preferences)
 * [xinlai/Math-Step-DPO-10K](https://huggingface.co/datasets/xinlai/Math-Step-DPO-10K)
+Here, we summarize the construction strategy of the Chinese datasets.
 #### Chinese DPO
 ##### [SPIN/SPPO](https://github.com/uclaml/SPIN)
+We followed the original design, using Kyara-SFT to generate a set of contrastive data for the High Quality Dataset.
 ##### RLAIF
 資料集：[zake7749/kyara-chinese-preference-dpo-s0-30K](https://huggingface.co/datasets/zake7749/kyara-chinese-preference-dpo-s0-30K)
+We extracted Chinese Prompts from `Magpie-Align/Magpie-Qwen2-Pro-200K-Chinese`, `hfl/stem_zh_instruction`, and `FreedomIntelligence/Evol-Instruct-Chinese-GPT4`, and distributed the same prompt to four different LLMs. The competitors include:
 * GPT-4o
 * GPT-4-0618
 * Qwen2-72B-Instruct
 * DeepSeek V2
+After response generation, we ask the LLMs to judge which one is better, using the following prompt:
 ```
 **[Task]**
 ---
 ```
+Finally, all four datasets were combined for DPO training.
 ## Limitation
+Like most models, Kyara still suffers from hallucinations, and possibly due to the nature of the data, it particularly enjoys quoting references when answering questions. For example, if you ask the model to explain a line from "Legend of Mortal":
 ```
 如何解釋「俠者，不受名利所牽，但求俯仰無愧。正邪存乎在我，聲名留問諸君。非我桀傲不馴，乃因唐門薪火，自古如今！」
 ```
+The model would respond that this quote is from The "Legend of the Condor Heroes."
 ```
 這段話出自《射鵰英雄傳》中的唐門內部經典，由郭靖和黃蓉在唐門大殿上宣讀。這段話主要表達了唐門門人對於俠義精神和內在修為的追求，以及他們對外界的名利和聲名的不屑。以下是詳細解釋：
 總結來說，這段話強調了唐門門人對名利和聲名的不屑，他們追求的是內在的修為和道德，不為外界的評價所動搖。這也體現了他們俠義精神和對正義的堅持。
 ```
+It is recommended to exercise caution when using language models.