nebchi commited on
Commit
6f52c3b
β€’
1 Parent(s): f1da208

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +137 -0
README.md ADDED
@@ -0,0 +1,137 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ pipeline_tag: text-generation
4
+ language:
5
+ - ko
6
+ ---
7
+
8
+ # Kor-Gemma-2B
9
+
10
+ > Update @ 2024.05.10: First release of gemma-ko
11
+ <!-- Provide a quick summary of what the model is/does. -->
12
+
13
+ This model card corresponds to the 2B-it version of the **Gemma-Ko** model.
14
+
15
+ **Resources and Technical Documentation**:
16
+
17
+ * [Original Gemma-2b-it](https://huggingface.co/google/gemma-2b-it)
18
+
19
+ **Citation**
20
+
21
+ ```bibtex
22
+ @misc {gemma-summary-v01 ,
23
+ author = { {frcp,nebchi,pepperonipizza} },
24
+ title = { gemma-summary-v01 },
25
+ year = 2024,
26
+ url = { https://huggingface.co/cpm-ai/gemma-ko-v01 },
27
+ publisher = { Hugging Face }
28
+ }
29
+ ```
30
+
31
+ **Model Developers**: frcp, nebchi, pepperonipizza
32
+
33
+ ## Model Information
34
+
35
+ I trained a language model using a dataset of 363,000 Korean text samples.
36
+
37
+ ### Description
38
+ It has been trained with a large amount of Korean tokens compared to other LLMs, enabling it to generate high-quality Korean text.
39
+ Additionally, it shows improved performance with less data compared to other LLM models.
40
+
41
+
42
+ #### Running the model on a single / multi GPU
43
+
44
+ ```python
45
+ from transformers import AutoTokenizer, AutoModelForCausalLM
46
+
47
+ tokenizer = AutoTokenizer.from_pretrained("cpm-ai/gemma-ko-v01")
48
+ model = AutoModelForCausalLM.from_pretrained("cpm-ai/gemma-ko-v01", device_map="auto")
49
+
50
+ prompt = """μš”μ•½ ν•  λ¬Έμž₯ :
51
+ μš”μ•½ ν•  λ¬Έμž₯ :
52
+ [μ•ˆλ…•ν•˜μ„Έμš” 생방솑 ν† λ‘ μΉ΄νŽ˜μž…λ‹ˆλ‹€.
53
+ 였늘 μ„±νƒ„μ ˆ μ „μ•Ό μƒλ°©μ†‘μœΌλ‘œ 진행해 λ“œλ¦¬κ³  μžˆλŠ”λ°μš”.
54
+ νŠΉμ§‘μœΌλ‘œ 저희가 λΆ„μœ„κΈ°λ„ 많이 λ°”κΏ”λ΄€κ³  또 μ˜€λŠ˜μ€ μ‚¬λž‘μ˜ κ³„μ ˆμ΄λ‹ˆλ§ŒνΌ λ‚˜λˆ”μ— λŒ€ν•΄μ„œ 이야기 ν•΄λ³ΌκΉŒ ν•©λ‹ˆλ‹€.
55
+ ν‰μ†Œ μƒν™œ μ†μ˜ λ‚˜λˆ”μ„ 늘 μ‹€μ²œν•˜κ³  κ³„μ‹œλŠ” λ„€ λΆ„ λͺ¨μ‹œκ³  이야기 λ‚˜λˆ λ³΄λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€ 그럼 λ„€ λΆ„ μ†Œκ°œν•΄ λ“œλ¦¬κ² μŠ΅λ‹ˆλ‹€ λ“€μ–΄μ˜€μ‹œμ£ .
56
+ λ„€ 였늘 생방솑 ν† λ‘  μΉ΄νŽ˜μ—μ„œλŠ” λ‚˜λˆ”μ˜ μ˜λ―Έμ— λŒ€ν•΄μ„œ 이야기 λ‚˜λˆ λ³ΌκΉŒ ν•˜λŠ”λ°μš”.
57
+ μ–΄ μ „μ•Όμ œ 이제 내일이면 크리슀마슀고 μ§€κΈˆ 아홉 μ‹œ μ‹­μ‚Ό λΆ„ μ§€λ‚˜κ³  μžˆκ±°λ“ μš” μ‚°νƒ€ν΄λ‘œμŠ€ 할아버지가 μƒλ‹Ήνžˆ 바빠진 그런 μ‹œκ°„μž…λ‹ˆλ‹€.
58
+ 이럴 λ•Œ κ°€μ‘±κ³Ό λ˜λŠ” μΉœμ§€λ“€κ³Ό ν•¨κ»˜ 보내셔야 될 이 κ·€ν•œ μ‹œκ°„ λ‚΄ μ£Όμ…”μ„œ μ˜€μ‹  λ„€ λΆ„ λ¨Όμ € μ†Œκ°œν•΄ λ“œλ¦¬λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€.
59
+ λ¨Όμ € &party-name1&의 μœ„μ›μž…λ‹ˆλ‹€.
60
+ μ•ˆλ…•ν•˜μ„Έμš”.
61
+ 그리고 μˆ­μ‹€λŒ€ν•™κ΅ μ‚¬νšŒμ‚¬μ—…ν•™κ³Όμ˜ κ΅μˆ˜μž…λ‹ˆλ‹€.
62
+ μ•ˆλ…•ν•˜μ„Έμš”.
63
+ 그리고 μ•„λ¦„λ‹€μš΄ μž¬λ‹¨μ˜ μƒμž„μ΄μ‚¬ μž…λ‹ˆλ‹€.
64
+ μ•ˆλ…•ν•˜μ„Έμš”.
65
+ 그리고 μ‚¬λž‘μ˜ μž₯κΈ°κΈ°μ¦μš΄λ™λ³ΈλΆ€μ— κ΅­μž₯λ‹˜μ΄μ‹­λ‹ˆλ‹€.
66
+ μ•ˆλ…•ν•˜μ„Έμš”.
67
+ μ΄λ ‡κ²Œ λ‚˜μ™€ μ£Όμ…”μ„œ λ‹€μ‹œ ν•œλ²ˆ κ°μ‚¬λ“œλ¦¬κ΅¬μš”.
68
+ 그리고 였늘 νŠΉλ³„νžˆ 저희 ν† λ‘  μΉ΄νŽ˜μ—λŠ” μš©μ‚°κ΅¬ μžμ›λ΄‰μ‚¬ μ„Όν„°μ—μ„œ λ΄‰μ‚¬ν™œλ™μ„ 늘 ν•˜μ‹œλŠ” 뢄듀이 λ‚˜μ™€μ£Όμ…¨μŠ΅λ‹ˆλ‹€.
69
+ 였늘 λ‚˜μ™€μ£Όμ‹  λΆ„λ“€ λ‹€μ‹œ ν•œλ²ˆ ν™˜μ˜ν•˜κ³  μ§„μ‹¬μœΌλ‘œ κ°μ‚¬λ“œλ¦½λ‹ˆλ‹€.
70
+ 늘 이런 μ–˜κΈ°λ₯Ό ν•˜μ£  μš°λ¦¬μ‚¬νšŒμ—λŠ” 아직도 곡동체 μ˜μ‹μ΄ λΆ€μ‘±ν•˜λ‹€ λ‚˜λˆ”μ˜ μ˜μ‹μ΄ λΆ€μ‘±ν•˜λ‹€ κΈ°λΆ€ λ¬Έν™”κ°€ 정착돼 μžˆμ§€ μ•Šλ‹€ 그런 μ–˜κΈ°λ“€μ„ 많이 ν•˜λŠ”λ°μš”.
71
+ μ–΄λ–»κ²Œ ν•˜λ©΄ κ·ΈλŸ¬ν•œ λ”°λœ»ν•œ μš°λ¦¬λ“€μ˜ λ§ˆμŒμ„ 더 ν‚€μš°κ³  더 λ‚˜λˆŒ 수 있고 또 그런 것을 μ–΄λ– ν•œ μ œλ„μ  μž₯치둜 잘 보완해 λ‚˜κ°ˆ 수 μžˆμ„κΉŒ
72
+ 그런 λ¬Έμ œλ“€μ„ ν•˜λ‚˜ν•˜λ‚˜ 이야기 λ‚˜λˆ λ³΄λ„λ‘ ν•˜κ² μŠ΅λ‹ˆλ‹€ λ‚˜λˆ”μ΄ λ„λŒ€μ²΄ μ™œ ν•„μš”ν•œμ§€ 그리고 원둠적인 μ–˜κΈ° κ² μ£ .
73
+ 그것뢀터 ν•œλ²ˆ μ–˜κΈ°λ₯Ό ν•œλ²ˆ ν•΄ 볼까 ν•©λ‹ˆλ‹€ λ¨Όμ € λ³€ν˜Έμ‚¬λ‹˜κ»˜μ„œ μ–˜κΈ°ν•΄ μ£Όμ‹œκ² μŠ΅λ‹ˆκΉŒ.
74
+ 자기 ν–‰λ³΅ν•˜κΈ° μœ„ν•΄μ„œμ£ .
75
+ {laughing} μ—­μ„€μ μœΌλ‘œ λ“€λ¦½λ‹ˆλ‹€.
76
+ 사싀 기뢀라든지 λ‚˜λˆ”μ΄λΌλŠ” 게 자기 μ£Όλ¨Έλ‹ˆμ—μ„œ 돈이 λ‚˜κ°€λ‹ˆκΉŒ μžκΈ°ν•œν…Œ 손해가 될 것 같은데 μ‹€μ œλ‘œ λ‚˜λˆ λ³Έ μ‚¬λžŒλ§Œ μ••λ‹ˆλ‹€.
77
+ {laughing} 이게 μ–Όλ§ˆλ‚˜ μžκΈ°κ°€ 슀슀둜 ν–‰λ³΅ν•΄μ§€λŠ”μ§€ κ·Έλž˜μ„œ μš”μƒˆ 뭐 λ‚˜λˆ”κΈ°λΆ€μ€‘λ…μ΄λΌλŠ” 말도 μžˆκ΅¬μš”.
78
+ 또 저희듀이 μ΄λ ‡κ²Œ μ„œμ–‘μ— 뭐 μ΄λ ‡κ²Œ λͺ¨κΈˆμ— κ΄€ν•œ 책을 읽어보면
79
+ κΈ°λΆ€ ν•΄ λ³Έ μ‚¬λžŒν•œν…Œ κ°€μ„œ 또 달라고 해라 이게 λͺ¨κΈˆν•˜λŠ” μ‚¬λžŒμ΄ 첫 번째 μ›μΉ™μœΌλ‘œ μ–˜κΈ°ν•΄μš”.
80
+ κ·Έ μ–˜κΈ°λŠ” 무슨 μ–˜κΈ°λƒλ©΄ ν•΄λ³Έ μ‚¬λžŒμ΄ μ¦κ±°μš°λ‹ˆκΉŒ λ˜ν•œ κ°€λŠ₯성이 λ§Žλ‹€λŠ” κ±°μ§€μš” μ•„λ§ˆ 이건 해보셔야 이거 μ œκ°€ 아무리 λ§μ”€λ“œλ €λ„ μ†Œμš©μ—†κ΅¬μš”.
81
+ μ‹€μ œ λ‚˜λˆ λ³΄μ…”μ•Ό κ·Έ 기쁨 즐거움을 μ•„μ‹œκ²Œ λ©λ‹ˆλ‹€.
82
+ κ²°κ΅­μ—λŠ” μžκΈ°ν•œν…Œ λŒμ•„μ˜¨λ‹€ 라고 ν•˜λŠ” 것이 μ„œμ–‘ μ‚¬λžŒλ“€μ—κ²Œ 많이 νŒ½λ°°ν–ˆλŠ”λ° μž₯기기증 같은 κ²½μš°μ—λ„ λ‚΄κ°€ 기증을 ν•˜λ©΄
83
+ 음 그것이 결ꡭ은 λ‚˜ν•œν…Œ λŒμ•„μ˜¨λ‹€λŠ” κ·Έ μ΄μœ κ°€ 뭐냐 ν•˜λ©΄ λ‚΄κ°€ μ–Έμ œλ“ μ§€ ν™˜μžκ°€ λ˜μ—ˆμ„ λ•Œ
84
+ μ‚¬νšŒ μ „λ°˜μ μœΌλ‘œ κ·Έλ ‡κ²Œ κΈ°μ¦ν•˜λŠ” κ·Έ λ¬Έν™”κ°€ ν™•μ‚°λ˜λ©΄ λ‚΄κ°€ ν™˜μžκ°€ 됐을 λ•Œ 그것이 κ²°κ΅­ λ‚˜ν•œν…Œ ν˜œνƒμ΄ λŒμ•„μ˜¨λ‹€ 라고 ν•΄μ„œ
85
+ 슀페인 같은 κ²½μš°μ—λŠ” 백만 λͺ…λ‹Ή 삼십사 λͺ…μœΌλ‘œ μ „ μ„Έκ³„μ μœΌλ‘œ κ°€μž₯ 많이 기증을 ν•˜κ³  μžˆλŠ”λ°
86
+ 그런 μ˜μ‹μ΄ 결ꡭ은 λ‚΄κ²Œ λŒμ•„μ˜€λŠ” 것���닀라고 ν•˜λŠ” μ˜μ‹μ΄ νŒ½λ°°ν•˜κΈ° λ•Œλ¬Έμ— κ·Έλ ‡κ²Œ λœλ‹€κ³ ν•΄μš”.
87
+ 백만 λͺ…λ‹Ή 삼십사 λͺ…μ΄λΌλŠ” 것은 μ‹€μ œ κΈ°μ¦ν•˜λŠ”
88
+ 예 μˆ«μžκ°€
89
+ 수치겠죠 그게 이루어지렀면 기증 μ„œμ•½μ€ ꡉμž₯히 더 λ§Žμ€ μ‚¬λžŒλ“€μ΄ ν•˜κ² λ„€μš”.
90
+ ]"""
91
+ formatted_prompt = f"Instruction: {prompt}\n output:"
92
+
93
+ outputs = pipe_finetuned(
94
+ formatted_prompt,
95
+ # do_sample=True,
96
+ temperature=0.1,
97
+ top_k=50,
98
+ top_p=0.95,
99
+ repetition_penalty=1.2,
100
+ add_special_tokens=True,
101
+ streamer = streamer
102
+ )
103
+
104
+ print(outputs[0]["generated_text"][len(formatted_prompt):])
105
+ ```
106
+
107
+ ### results
108
+ ```python
109
+ 제λͺ©: λ‚˜λˆ”μ˜ μ˜λ―Έμ™€ ν•„μš”μ„±μ— λŒ€ν•œ ν† λ‘ 
110
+
111
+ 1. λ‚˜λˆ”μ˜ μ˜λ―Έμ™€ μ€‘μš”μ„±
112
+ - λ‚˜λˆ”μ€ νŠΉμ • λ‚ μ§œμ—, νŠΉμ • μ‚¬λžŒλ“€κ³Ό ν•¨κ»˜ ν•˜λŠ” μ‹œκ°„μ„ μ˜λ―Έν•œλ‹€.
113
+ - νŠΉλ³„νžˆ, ν¬λ¦¬μŠ€λ§ˆμŠ€μ™€ μ‚°νƒ€ν΄λ‘œμŠ€λ₯Ό ν¬ν•¨ν•œ 일뢀 λ‚ μ§œμ—λŠ” κ°€μ‘±κ³Ό μΉœμ§€λ“€κ³Ό ν•¨κ»˜ λ‚˜λˆ”μ„ ν•  수 μžˆλ‹€.
114
+ - λ‚˜λˆ”μ€ κ°€μ‘±κ³Ό μΉœμ§€λ“€κ³Ό ν•¨κ»˜ λ³΄λ‚΄λŠ” μ‹œκ°„μ΄λΌλŠ” μ μ—μ„œ μ€‘μš”ν•˜λ‹€.
115
+
116
+ 2. λ‚˜λˆ”μ˜ ν•„μš”μ„±
117
+ - 곡동체 μ˜μ‹μ΄ λΆ€μ‘±ν•˜λ©°, κΈ°λΆ€ λ¬Έν™”κ°€ μ •μ°©λ˜μ§€ μ•Šμ•˜λ‹€.
118
+ - λ‚˜λˆ”μ˜ μ˜μ‹μ΄ λΆ€μ‘±ν•˜μ—¬, κΈ°λΆ€ λ¬Έν™”κ°€ 잘 λ³΄μ™„λ˜μ§€ μ•Šμ•˜λ‹€.
119
+
120
+ 3. λ‚˜λˆ”μ˜ 원둠적 μ–˜κΈ°
121
+ - λ‚˜λˆ”μ€ κΈ°λΆ€λ‘œλΆ€ν„° μ–»λŠ” 것이 μ•„λ‹ˆλΌ, κΈ°λΆ€λ₯Ό 톡해 μ–»λŠ” 것이 μ•„λ‹ˆλΌ, κΈ°λΆ€λ₯Ό 톡해 μ–»λŠ” 것이 μ•„λ‹ˆλΌ, κΈ°λΆ€λ₯Ό 톡해 μ–»λŠ” κ²ƒμ΄λΌλŠ” μ˜μ‹μ΄ ν•„μš”ν•˜λ‹€.
122
+ - λ‚˜λˆ”μ˜ μ˜μ‹μ„ ν‚€μš°κ³  λ‚˜λˆŒ 수 μžˆλ„λ‘ μ œλ„μ  μž₯μΉ˜κ°€ ν•„μš”ν•˜λ‹€.
123
+
124
+ 4. λ‚˜λˆ”μ˜ μ˜μ‹μ— λŒ€ν•œ λ…Όμ˜
125
+ - λ³€ν˜Έμ‚¬λŠ” λ‚˜λˆ”μ΄ 자기 행볡을 μœ„ν•œ κ²ƒμ΄λΌλŠ” μ˜κ²¬μ„ μ œμ‹œν–ˆλ‹€.
126
+ - λ‚˜λˆ”μ΄ κΈ°λΆ€λ‘œλΆ€ν„° μ–»λŠ” κ²ƒμ΄λΌλŠ” μ˜κ²¬λ„ μ œμ‹œλ˜μ—ˆλ‹€.
127
+ - λ‚˜λˆ”μ΄ κ²°κ΅­ ν™˜μžκ°€ λ˜λŠ” κ²ƒμ΄λΌλŠ” μ˜κ²¬λ„ μ œμ‹œλ˜μ—ˆλ‹€.
128
+ ```
129
+
130
+ ### Inputs and outputs
131
+
132
+ * **Input:** Text string, such as a question, a prompt, or a document to be summarized.
133
+ * **Output:** Generated Korea text in response to the input, such an answer to a question, or a summary of a minutes.
134
+
135
+ ### Software
136
+
137
+ Training was done using QLoRA