Cartinoe5930 commited on
Commit
20a153f
β€’
1 Parent(s): 293b320

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +114 -0
README.md CHANGED
@@ -1,3 +1,117 @@
1
  ---
2
  license: cc-by-nc-sa-4.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc-by-nc-sa-4.0
3
+ datasets:
4
+ - Cartinoe5930/KoRAE_filtered_12k
5
+ language:
6
+ - ko
7
+ library_name: transformers
8
  ---
9
+
10
+ ## KoRAE
11
+
12
+ <p align="center"><img src="https://cdn-uploads.huggingface.co/production/uploads/63e087b6a98d931aa90c1b9c/XQ-pNzRDRccd7UFgYDOrx.png", width='300', height='300'></p>
13
+
14
+ We introduce **KoRAE** which finetuned with filtered high-quality Korean dataset.
15
+
16
+ The **KoRAE** is output of combination of high-quality data which filtered by special data filtering method and Korean Llama-2 that Korean vocabularis were added.
17
+ We utilized special data filtering methods which introduced in [AlpaGasus](https://arxiv.org/abs/2307.08701) to filter high-quality data from mixture of several Korean datasets(OpenOrca-KO, KOpen-Platypus, KoCoT_2000, databricks-dolly-15k-ko).
18
+ We finetuned [Korean Llama-2](https://huggingface.co/beomi/llama-2-koen-13b) that introduced by [@beomi](https://huggingface.co/beomi) on the filtered dataset.
19
+ The Flash-Attention2 and LoRA were utilized for efficient finetuning.
20
+
21
+ The KoRAE will be uploaded in [Open Ko-LLM Leaderboard](https://huggingface.co/spaces/upstage/open-ko-llm-leaderboard)!
22
+ Stay tuned for the update of KoRAE!
23
+
24
+ ## Model Details
25
+
26
+ - **Developed by:** [Cartinoe5930](https://huggingface.co/Cartinoe5930)
27
+ - **Base model:** [beomi/llama-2-koen-13b](https://huggingface.co/beomi/llama-2-koen-13b)
28
+ - **Repository:** [gauss5930/KoRAE](https://github.com/gauss5930/KoRAE)
29
+
30
+ For more details, please check the GitHub Repository!
31
+
32
+ ## Training Details
33
+
34
+ - **Hardward:** We utilized A100 80G for finetuning
35
+ - **Training factors:** The [Transformers Trainer](https://huggingface.co/docs/transformers/main_classes/trainer) and [Huggingface PEFT](https://huggingface.co/docs/peft/index) were utilized for finetuning.
36
+
37
+ For more details, please check the GitHub Repository!
38
+
39
+ ## Training Dataset
40
+
41
+ The KoRAE was finetuned with KoRAE dataset filtered high-quality dataset.
42
+ This dataset is a combination of the publicly available Koraen dataset and a filtering method was applied to the result of the combination dataset.
43
+ For more information, please refer to the [dataset card](https://huggingface.co/datasets/Cartinoe5930/KoRAE_filtered_12k) of KoRAE.
44
+
45
+ ## Open Ko-LLM Leaderboard
46
+
47
+ ## Prompt Template
48
+
49
+ ```
50
+ ### System:
51
+ {system_prompt}
52
+
53
+ ### User:
54
+ {instruction + input}
55
+
56
+ ### Assistant:
57
+ {output}
58
+ ```
59
+
60
+ ## Usage example
61
+
62
+ ```python
63
+ # Use a pipeline as a high-level helper
64
+ from transformers import pipeline
65
+ import torch
66
+
67
+ pipe = pipeline("text-generation", model="Cartinoe5930/KoRAE-13b", torch_dtype=torch.bfloat16, device_map="auto")
68
+ messages = [
69
+ {
70
+ "role": "system",
71
+ "content": "당신은 μœ μš©ν•œ 인곡지λŠ₯ λΉ„μ„œμž…λ‹ˆλ‹€. μ‚¬μš©μžκ°€ λͺ‡ 가지 μ§€μ‹œκ°€ ν¬ν•¨λœ μž‘μ—…μ„ μ œκ³΅ν•©λ‹ˆλ‹€. μš”μ²­μ„ 적절히 μ™„λ£Œν•˜λŠ” 응닡을 μž‘μ„±ν•˜μ„Έμš”.",
72
+ },
73
+ {"role": "user", "content": "슀트레슀λ₯Ό ν•΄μ†Œν•˜λŠ” 5가지 방법에 λŒ€ν•΄μ„œ μ„€λͺ…ν•΄μ€˜."}
74
+ ]
75
+
76
+ prompt = pipe.tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
77
+ outputs = pipe(prompt, max_new_tokens=256, do_sample=True, temperature=0.7, top_k=50, top_p=0.95)
78
+ print(outputs[0]["generated_text"])
79
+ ```
80
+
81
+ ## Citation
82
+
83
+ - [KO-Platypus](https://github.com/Marker-Inc-Korea/KO-Platypus)
84
+ - [Korean-OpenOrca](https://github.com/Marker-Inc-Korea/Korean-OpenOrca)
85
+
86
+ ```
87
+ @inproceedings{lee2023kullm,
88
+ title={KULLM: Learning to Construct Korean Instruction-following Large Language Models},
89
+ author={Lee, SeungJun and Lee, Taemin and Lee, Jeongwoo and Jang, Yoona and Lim, Heuiseok},
90
+ booktitle={Annual Conference on Human and Language Technology},
91
+ pages={196--202},
92
+ year={2023},
93
+ organization={Human and Language Technology}
94
+ }
95
+ ```
96
+
97
+ ```
98
+ @misc{chen2023alpagasus,
99
+ title={AlpaGasus: Training A Better Alpaca with Fewer Data},
100
+ author={Lichang Chen and Shiyang Li and Jun Yan and Hai Wang and Kalpa Gunaratna and Vikas Yadav and Zheng Tang and Vijay Srinivasan and Tianyi Zhou and Heng Huang and Hongxia Jin},
101
+ year={2023},
102
+ eprint={2307.08701},
103
+ archivePrefix={arXiv},
104
+ primaryClass={cs.CL}
105
+ }
106
+ ```
107
+
108
+ ```
109
+ @misc {l._junbum_2023,
110
+ author = { {L. Junbum, Taekyoon Choi} },
111
+ title = { llama-2-koen-13b },
112
+ year = 2023,
113
+ url = { https://huggingface.co/beomi/llama-2-koen-13b },
114
+ doi = { 10.57967/hf/1280 },
115
+ publisher = { Hugging Face }
116
+ }
117
+ ```