model documentation

#1
by nazneen - opened
Files changed (1) hide show
  1. README.md +189 -4
README.md CHANGED
@@ -1,16 +1,201 @@
1
  ---
2
  language: ko
 
 
3
  ---
4
 
5
- # Bert base model for Korean
6
 
7
- * 70GB Korean text dataset and 42000 lower-cased subwords are used
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
8
  * Check the model performance and other language models for Korean in [github](https://github.com/kiyoungkim1/LM-kor)
9
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ```python
11
- # only for pytorch in transformers
12
  from transformers import BertTokenizerFast, EncoderDecoderModel
13
 
14
  tokenizer = BertTokenizerFast.from_pretrained("kykim/bertshared-kor-base")
15
  model = EncoderDecoderModel.from_pretrained("kykim/bertshared-kor-base")
16
- ```
 
 
 
 
 
1
  ---
2
  language: ko
3
+ tags:
4
+ - text-2-text-generation
5
  ---
6
 
 
7
 
8
+ # Model Card for Bert base model for Korean
9
+
10
+ # Model Details
11
+
12
+ ## Model Description
13
+
14
+ More information needed.
15
+
16
+ - **Developed by:** kiyoung kim
17
+ - **Shared by [Optional]:** kiyoung kim
18
+ - **Model type:** Text2Text Generation
19
+ - **Language(s) (NLP):** Korean
20
+ - **License:** More information needed
21
+ - **Parent Model:** bert-base-multilingual-uncased
22
+ - **Resources for more information:**
23
+ - [GitHub Repo](https://github.com/kiyoungkim1/LM-kor)
24
+
25
+
26
+
27
+ # Uses
28
+
29
+
30
+ ## Direct Use
31
+ This model can be used for the task of text2text generation.
32
+
33
+ ## Downstream Use [Optional]
34
+
35
+ More information needed.
36
+
37
+ ## Out-of-Scope Use
38
+
39
+ The model should not be used to intentionally create hostile or alienating environments for people.
40
+
41
+ # Bias, Risks, and Limitations
42
+
43
+
44
+ Significant research has explored bias and fairness issues with language models (see, e.g., [Sheng et al. (2021)](https://aclanthology.org/2021.acl-long.330.pdf) and [Bender et al. (2021)](https://dl.acm.org/doi/pdf/10.1145/3442188.3445922)). Predictions generated by the model may include disturbing and harmful stereotypes across protected classes; identity characteristics; and sensitive, social, and occupational groups.
45
+
46
+
47
+
48
+ ## Recommendations
49
+
50
+
51
+ Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
52
+
53
+ # Training Details
54
+
55
+ ## Training Data
56
+ * 70GB Korean text dataset and 42000 lower-cased subwords are used
57
+
58
+ The model authors also note in the [GitHub Repo](https://github.com/kiyoungkim1/LM-kor):
59
+
60
+ > ํ•™์Šต์— ์‚ฌ์šฉํ•œ ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
61
+ 1.) ๊ตญ๋‚ด ์ฃผ์š” ์ปค๋จธ์Šค ๋ฆฌ๋ทฐ 1์–ต๊ฐœ + ๋ธ”๋กœ๊ทธ ํ˜• ์›น์‚ฌ์ดํŠธ 2000๋งŒ๊ฐœ (75GB)
62
+ 2.) ๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜ (18GB)
63
+ 3.) ์œ„ํ‚คํ”ผ๋””์•„์™€ ๋‚˜๋ฌด์œ„ํ‚ค (6GB)
64
+ ๋ถˆํ•„์š”ํ•˜๊ฑฐ๋‚˜ ๋„ˆ๋ฌด ์งค์€ ๋ฌธ์žฅ, ์ค‘๋ณต๋˜๋Š” ๋ฌธ์žฅ๋“ค์„ ์ œ์™ธํ•˜์—ฌ 100GB์˜ ๋ฐ์ดํ„ฐ ์ค‘ ์ตœ์ข…์ ์œผ๋กœ 70GB (์•ฝ 127์–ต๊ฐœ์˜ token)์˜ ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.
65
+ ๋ฐ์ดํ„ฐ๋Š” ํ™”์žฅํ’ˆ(8GB), ์‹ํ’ˆ(6GB), ์ „์ž์ œํ’ˆ(13GB), ๋ฐ˜๋ ค๋™๋ฌผ(2GB) ๋“ฑ๋“ฑ์˜ ์นดํ…Œ๊ณ ๋ฆฌ๋กœ ๋ถ„๋ฅ˜๋˜์–ด ์žˆ์œผ๋ฉฐ ๋„๋ฉ”์ธ ํŠนํ™” ์–ธ์–ด๋ชจ๋ธ ํ•™์Šต์— ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค
66
+
67
+
68
+ ## Training Procedure
69
+
70
+
71
+ ### Preprocessing
72
+
73
+ The model authors also note in the [GitHub Repo](https://github.com/kiyoungkim1/LM-kor):
74
+ > BERT ๋ชจ๋ธ์—๋Š” whole-word-masking์ด ์ ์šฉ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
75
+
76
+ > ํ•œ๊ธ€, ์˜์–ด, ์ˆซ์ž์™€ ์ผ๋ถ€ ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ์ œ์™ธํ•œ ๋ฌธ์ž๋Š” ํ•™์Šต์— ๋ฐฉํ•ด๊ฐ€๋œ๋‹ค๊ณ  ํŒ๋‹จํ•˜์—ฌ ์‚ญ์ œํ•˜์˜€์Šต๋‹ˆ๋‹ค(์˜ˆ์‹œ: ํ•œ์ž, ์ด๋ชจ์ง€ ๋“ฑ)
77
+ [Huggingface tokenizers](https://github.com/huggingface/tokenizers) ์˜ wordpiece๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด 40000๊ฐœ์˜ subword๋ฅผ ์ƒ์„ฑํ•˜์˜€์Šต๋‹ˆ๋‹ค.
78
+ ์—ฌ๊ธฐ์— 2000๊ฐœ์˜ unused token๊ณผ ๋„ฃ์–ด ํ•™์Šตํ•˜์˜€์œผ๋ฉฐ, unused token๋Š” ๋„๋ฉ”์ธ ๋ณ„ ํŠนํ™” ์šฉ์–ด๋ฅผ ๋‹ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋ฉ๋‹ˆ๋‹ค.
79
+
80
+ ### Speeds, Sizes, Times
81
+ More information needed
82
+
83
+
84
+ # Evaluation
85
+
86
+
87
+ ## Testing Data, Factors & Metrics
88
+
89
+ ### Testing Data
90
+
91
+ More information needed
92
+
93
+
94
+ ### Factors
95
+ More information needed
96
+
97
+ ### Metrics
98
+
99
+ More information needed
100
+
101
+
102
+ ## Results
103
+
104
  * Check the model performance and other language models for Korean in [github](https://github.com/kiyoungkim1/LM-kor)
105
 
106
+ | | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **Korean-Hate-Speech (Dev)**<br/>(F1) |
107
+ | :-------------------- | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :-----------------------------------: |
108
+ | kcbert-base | 89.87 | 85.00 | 67.40 | 75.57 | 75.94 | 93.93 | **68.78** |
109
+ |**OURS**|
110
+ | **bert-kor-base** | 90.87 | 87.27 | 82.80 | 82.32 | 84.31 | 95.25 | 68.45 |
111
+
112
+
113
+
114
+ # Model Examination
115
+
116
+ More information needed
117
+
118
+ # Environmental Impact
119
+
120
+ Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
121
+
122
+ - **Hardware Type:** More information needed
123
+ - **Hours used:** More information needed
124
+ - **Cloud Provider:** More information needed
125
+ - **Compute Region:** More information needed
126
+ - **Carbon Emitted:** More information needed
127
+
128
+ # Technical Specifications [optional]
129
+
130
+ ## Model Architecture and Objective
131
+
132
+ More information needed
133
+
134
+ ## Compute Infrastructure
135
+
136
+ More information needed
137
+
138
+ ### Hardware
139
+
140
+
141
+ More information needed
142
+
143
+ ### Software
144
+
145
+ More information needed.
146
+
147
+ # Citation
148
+
149
+
150
+ **BibTeX:**
151
+
152
+
153
+ ```bibtex
154
+ @misc{kim2020lmkor,
155
+ author = {Kiyoung Kim},
156
+ title = {Pretrained Language Models For Korean},
157
+ year = {2020},
158
+ publisher = {GitHub},
159
+ howpublished = {\url{https://github.com/kiyoungkim1/LMkor}}
160
+ }
161
+ ```
162
+
163
+
164
+
165
+
166
+ # Glossary [optional]
167
+ More information needed
168
+
169
+ # More Information [optional]
170
+ * Cloud TPUs are provided by [TensorFlow Research Cloud (TFRC)](https://www.tensorflow.org/tfrc/) program.
171
+
172
+ * Also, [๋ชจ๋‘์˜ ๋ง๋ญ‰์น˜](https://corpus.korean.go.kr/) is used for pretraining data.
173
+
174
+
175
+ # Model Card Authors [optional]
176
+
177
+ Kiyoung kim in collaboration with Ezi Ozoani and the Hugging Face team
178
+
179
+
180
+ # Model Card Contact
181
+
182
+ More information needed
183
+
184
+ # How to Get Started with the Model
185
+
186
+ Use the code below to get started with the model.
187
+
188
+ <details>
189
+ <summary> Click to expand </summary>
190
+
191
  ```python
192
+ # only for pytorch in transformers
193
  from transformers import BertTokenizerFast, EncoderDecoderModel
194
 
195
  tokenizer = BertTokenizerFast.from_pretrained("kykim/bertshared-kor-base")
196
  model = EncoderDecoderModel.from_pretrained("kykim/bertshared-kor-base")
197
+ ```
198
+ </details>
199
+
200
+
201
+