File size: 5,232 Bytes
e45a29b
 
20a153f
 
 
 
 
e45a29b
20a153f
 
 
 
 
 
 
 
 
 
 
 
7800a1b
 
 
 
 
20a153f
 
 
 
 
 
 
 
 
 
 
 
 
ad0a2d8
20a153f
 
 
 
 
 
 
 
 
 
 
7800a1b
 
 
 
20a153f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
---
license: cc-by-nc-sa-4.0
datasets:
- Cartinoe5930/KoRAE_filtered_12k
language:
- ko
library_name: transformers
---

## KoRAE

<p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/63e087b6a98d931aa90c1b9c/XQ-pNzRDRccd7UFgYDOrx.png", width='300', height='300'></p>

We introduce **KoRAE** which finetuned with filtered high-quality Korean dataset.

The **KoRAE** is output of combination of high-quality data which filtered by special data filtering method and Korean Llama-2 that Korean vocabularis were added. 
We utilized special data filtering methods which introduced in [AlpaGasus](https://arxiv.org/abs/2307.08701) to filter high-quality data from mixture of several Korean datasets(OpenOrca-KO, KOpen-Platypus, KoCoT_2000, databricks-dolly-15k-ko). 
We finetuned [Korean Llama-2](https://huggingface.co/beomi/llama-2-koen-13b) that introduced by [@beomi](https://huggingface.co/beomi) on the filtered dataset.
The Flash-Attention2 and LoRA were utilized for efficient finetuning.

The finding of KoRAE is as follows:

1. The finetuning in some epochs showed that high-quality filtered data has positive effects on model's performance. However, finetuning in a few epochs, the quantity of data is more matter than quality. It seems to be due to the lack of performance of the Korean base model. Therefore, the research to improve the Korean base model must continue.
2. The model trained with DPO showed best performance among KoRAE variants. This shows that DPO is clearly effective in the Korean LLM.
3. The model finetuned with filtered high-quality KoRAE showed better performance than without. Therefore, for better LLM, we should try to finetune the LLM with high-quality data.

## Model Details

- **Developed by:** [Cartinoe5930](https://huggingface.co/Cartinoe5930)
- **Base model:** [beomi/llama-2-koen-13b](https://huggingface.co/beomi/llama-2-koen-13b)
- **Repository:** [gauss5930/KoRAE](https://github.com/gauss5930/KoRAE)

For more details, please check the GitHub Repository!

## Training Details

- **Hardward:** We utilized A100 80G for finetuning
- **Training factors:** The [Transformers Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) and [Huggingface PEFT](https://huggingface.co/docs/peft/index) were utilized for finetuning.
- **Training Details:** Supervised finetuning 3 epochs on [filtered KoRAE](https://huggingface.co/datasets/Cartinoe5930/KoRAE_filtered_12k) dataset

For more details, please check the GitHub Repository!

## Training Dataset

The KoRAE was finetuned with KoRAE dataset filtered high-quality dataset.
This dataset is a combination of the publicly available Koraen dataset and a filtering method was applied to the result of the combination dataset.
For more information, please refer to the [dataset card](https://huggingface.co/datasets/Cartinoe5930/KoRAE_filtered_12k) of KoRAE.

## Open Ko-LLM Leaderboard

|Model|Average|Ko-ARC|Ko-HellaSwag|Ko-MMLU|Ko-TruthfulQA|Ko-CommonGen V2|
|---|---|---|---|---|---|---|
|KoRAE-13b|48.64|46.33|57.25|42.8|41.08|55.73|

## Prompt Template

```
### System:
{system_prompt}

### User:
{instruction + input}

### Assistant:
{output}
```

## Usage example

```python
# Use a pipeline as a high-level helper
from transformers import pipeline
import torch

pipe = pipeline("text-generation", model="Cartinoe5930/KoRAE-13b", torch_dtype=torch.bfloat16, device_map="auto")
messages = [
    {
        "role": "system",
        "content": "당신은 μœ μš©ν•œ 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€. μ‚¬μš©μžκ°€ λͺ‡ 가지 μ§€μ‹œκ°€ ν¬ν•¨λœ μž‘μ—…μ„ μ œκ³΅ν•©λ‹ˆλ‹€. μš”μ²­μ„ 적절히 μ™„λ£Œν•˜λŠ” 응닡을 μž‘μ„±ν•˜μ„Έμš”.",
    },
    {"role": "user", "content": "슀트레슀λ₯Ό ν•΄μ†Œν•˜λŠ” 5가지 방법에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ€˜."}
]

prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
print(outputs[0]["generated_text"])
```

## Citation

- [KO-Platypus](https://github.com/Marker-Inc-Korea/KO-Platypus)
- [Korean-OpenOrca](https://github.com/Marker-Inc-Korea/Korean-OpenOrca)

```
@inproceedings{lee2023kullm,
  title={KULLM: Learning to Construct Korean Instruction-following Large Language Models},
  author={Lee, SeungJun and Lee, Taemin and Lee, Jeongwoo and Jang, Yoona and Lim, Heuiseok},
  booktitle={Annual Conference on Human and Language Technology},
  pages={196--202},
  year={2023},
  organization={Human and Language Technology}
}
```

```
@misc{chen2023alpagasus,
      title={AlpaGasus: Training A Better Alpaca with Fewer Data}, 
      author={Lichang Chen and Shiyang Li and Jun Yan and Hai Wang and Kalpa Gunaratna and Vikas Yadav and Zheng Tang and Vijay Srinivasan and Tianyi Zhou and Heng Huang and Hongxia Jin},
      year={2023},
      eprint={2307.08701},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```

```
@misc {l._junbum_2023,
    author       = { {L. Junbum, Taekyoon Choi} },
    title        = { llama-2-koen-13b },
    year         = 2023,
    url          = { https://huggingface.co/beomi/llama-2-koen-13b },
    doi          = { 10.57967/hf/1280 },
    publisher    = { Hugging Face }
}
```