chyun commited on
Commit
2566ad9
โ€ข
1 Parent(s): 963120c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -365
README.md CHANGED
@@ -3,368 +3,4 @@ language: ko
3
  license: apache-2.0
4
  tags:
5
  - korean
6
- ---
7
-
8
- # KcBERT: Korean comments BERT
9
-
10
- ** Updates on 2021.04.07 **
11
-
12
- - KcELECTRA๊ฐ€ ๋ฆด๋ฆฌ์ฆˆ ๋˜์—ˆ์Šต๋‹ˆ๋‹ค!๐Ÿค—
13
- - KcELECTRA๋Š” ๋ณด๋‹ค ๋” ๋งŽ์€ ๋ฐ์ดํ„ฐ์…‹, ๊ทธ๋ฆฌ๊ณ  ๋” ํฐ General vocab์„ ํ†ตํ•ด KcBERT ๋Œ€๋น„ **๋ชจ๋“  ํƒœ์Šคํฌ์—์„œ ๋” ๋†’์€ ์„ฑ๋Šฅ**์„ ๋ณด์ž…๋‹ˆ๋‹ค.
14
- - ์•„๋ž˜ ๊นƒํ—™ ๋งํฌ์—์„œ ์ง์ ‘ ์‚ฌ์šฉํ•ด๋ณด์„ธ์š”!
15
- - https://github.com/Beomi/KcELECTRA
16
-
17
- ** Updates on 2021.03.14 **
18
-
19
- - KcBERT Paper ์ธ์šฉ ํ‘œ๊ธฐ๋ฅผ ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.(bibtex)
20
- - KcBERT-finetune Performance score๋ฅผ ๋ณธ๋ฌธ์— ์ถ”๊ฐ€ํ•˜์˜€์Šต๋‹ˆ๋‹ค.
21
-
22
- ** Updates on 2020.12.04 **
23
-
24
- Huggingface Transformers๊ฐ€ v4.0.0์œผ๋กœ ์—…๋ฐ์ดํŠธ๋จ์— ๋”ฐ๋ผ Tutorial์˜ ์ฝ”๋“œ๊ฐ€ ์ผ๋ถ€ ๋ณ€๊ฒฝ๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
25
-
26
- ์—…๋ฐ์ดํŠธ๋œ KcBERT-Large NSMC Finetuning Colab: <a href="https://colab.research.google.com/drive/1dFC0FL-521m7CL_PSd8RLKq67jgTJVhL?usp=sharing">
27
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
28
- </a>
29
-
30
- ** Updates on 2020.09.11 **
31
-
32
- KcBERT๋ฅผ Google Colab์—์„œ TPU๋ฅผ ํ†ตํ•ด ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ํŠœํ† ๋ฆฌ์–ผ์„ ์ œ๊ณตํ•ฉ๋‹ˆ๋‹ค! ์•„๋ž˜ ๋ฒ„ํŠผ์„ ๋ˆŒ๋Ÿฌ๋ณด์„ธ์š”.
33
-
34
- Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ: <a href="https://colab.research.google.com/drive/1lYBYtaXqt9S733OXdXvrvC09ysKFN30W">
35
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
36
- </a>
37
-
38
- ํ…์ŠคํŠธ ๋ถ„๋Ÿ‰๋งŒ ์ „์ฒด 12G ํ…์ŠคํŠธ ์ค‘ ์ผ๋ถ€(144MB)๋กœ ์ค„์—ฌ ํ•™์Šต์„ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.
39
-
40
- ํ•œ๊ตญ์–ด ๋ฐ์ดํ„ฐ์…‹/์ฝ”ํผ์Šค๋ฅผ ์ข€๋” ์‰ฝ๊ฒŒ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋Š” [Korpora](https://github.com/ko-nlp/Korpora) ํŒจํ‚ค์ง€๋ฅผ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
41
-
42
- ** Updates on 2020.09.08 **
43
-
44
- Github Release๋ฅผ ํ†ตํ•ด ํ•™์Šต ๋ฐ์ดํ„ฐ๋ฅผ ์—…๋กœ๋“œํ•˜์˜€์Šต๋‹ˆ๋‹ค.
45
-
46
- ๋‹ค๋งŒ ํ•œ ํŒŒ์ผ๋‹น 2GB ์ด๋‚ด์˜ ์ œ์•ฝ์œผ๋กœ ์ธํ•ด ๋ถ„ํ• ์••์ถ•๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.
47
-
48
- ์•„๋ž˜ ๋งํฌ๋ฅผ ํ†ตํ•ด ๋ฐ›์•„์ฃผ์„ธ์š”. (๊ฐ€์ž… ์—†์ด ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋ถ„ํ• ์••์ถ•)
49
-
50
- ๋งŒ์•ฝ ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›๊ณ ์‹ถ์œผ์‹œ๊ฑฐ๋‚˜/Kaggle์—์„œ ๋ฐ์ดํ„ฐ๋ฅผ ์‚ดํŽด๋ณด๊ณ  ์‹ถ์œผ์‹œ๋‹ค๋ฉด ์•„๋ž˜์˜ ์บ๊ธ€ ๋ฐ์ดํ„ฐ์…‹์„ ์ด์šฉํ•ด์ฃผ์„ธ์š”.
51
-
52
- - Github๋ฆด๋ฆฌ์ฆˆ: https://github.com/Beomi/KcBERT/releases/tag/TrainData_v1
53
-
54
- ** Updates on 2020.08.22 **
55
-
56
- Pretrain Dataset ๊ณต๊ฐœ
57
-
58
- - ์บ๊ธ€: https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments (ํ•œ ํŒŒ์ผ๋กœ ๋ฐ›์„ ์ˆ˜ ์žˆ์–ด์š”. ๋‹จ์ผํŒŒ์ผ)
59
-
60
- Kaggle์— ํ•™์Šต์„ ์œ„ํ•ด ์ •์ œํ•œ(์•„๋ž˜ `clean`์ฒ˜๋ฆฌ๋ฅผ ๊ฑฐ์นœ) Dataset์„ ๊ณต๊ฐœํ•˜์˜€์Šต๋‹ˆ๋‹ค!
61
-
62
- ์ง์ ‘ ๋‹ค์šด๋ฐ›์œผ์…”์„œ ๋‹ค์–‘ํ•œ Task์— ํ•™์Šต์„ ์ง„ํ–‰ํ•ด๋ณด์„ธ์š” :)
63
-
64
- ---
65
-
66
- ๊ณต๊ฐœ๋œ ํ•œ๊ตญ์–ด BERT๋Š” ๋Œ€๋ถ€๋ถ„ ํ•œ๊ตญ์–ด ์œ„ํ‚ค, ๋‰ด์Šค ๊ธฐ์‚ฌ, ์ฑ… ๋“ฑ ์ž˜ ์ •์ œ๋œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•™์Šตํ•œ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค. ํ•œํŽธ, ์‹ค์ œ๋กœ NSMC์™€ ๊ฐ™์€ ๋Œ“๊ธ€ํ˜• ๋ฐ์ดํ„ฐ์…‹์€ ์ •์ œ๋˜์ง€ ์•Š์•˜๊ณ  ๊ตฌ์–ด์ฒด ํŠน์ง•์— ์‹ ์กฐ์–ด๊ฐ€ ๋งŽ์œผ๋ฉฐ, ์˜คํƒˆ์ž ๋“ฑ ๊ณต์‹์ ์ธ ๊ธ€์“ฐ๊ธฐ์—์„œ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š๋Š” ํ‘œํ˜„๋“ค์ด ๋นˆ๋ฒˆํ•˜๊ฒŒ ๋“ฑ์žฅํ•ฉ๋‹ˆ๋‹ค.
67
-
68
- KcBERT๋Š” ์œ„์™€ ๊ฐ™์€ ํŠน์„ฑ์˜ ๋ฐ์ดํ„ฐ์…‹์— ์ ์šฉํ•˜๊ธฐ ์œ„ํ•ด, ๋„ค์ด๋ฒ„ ๋‰ด์Šค์—์„œ ๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€์„ ์ˆ˜์ง‘ํ•ด, ํ† ํฌ๋‚˜์ด์ €์™€ BERT๋ชจ๋ธ์„ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šตํ•œ Pretrained BERT ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.
69
-
70
- KcBERT๋Š” Huggingface์˜ Transformers ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ๊ฐ„ํŽธํžˆ ๋ถˆ๋Ÿฌ์™€ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค. (๋ณ„๋„์˜ ํŒŒ์ผ ๋‹ค์šด๋กœ๋“œ๊ฐ€ ํ•„์š”ํ•˜์ง€ ์•Š์Šต๋‹ˆ๋‹ค.)
71
-
72
- ## KcBERT Performance
73
-
74
- - Finetune ์ฝ”๋“œ๋Š” https://github.com/Beomi/KcBERT-finetune ์—์„œ ์ฐพ์•„๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
75
-
76
- | | Size<br/>(์šฉ๋Ÿ‰) | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |
77
- | :-------------------- | :---: | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: |
78
- | KcBERT-Base | 417M | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |
79
- | KcBERT-Large | 1.2G | **90.68** | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |
80
- | KoBERT | 351M | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |
81
- | XLM-Roberta-Base | 1.03G | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |
82
- | HanBERT | 614M | 90.16 | **87.31** | 82.40 | **80.89** | 83.33 | 94.19 | 78.74 / 92.02 |
83
- | KoELECTRA-Base | 423M | **90.21** | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |
84
- | KoELECTRA-Base-v2 | 423M | 89.70 | 87.02 | **83.90** | 80.61 | **84.30** | **94.72** | **84.34 / 92.58** |
85
- | DistilKoBERT | 108M | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |
86
-
87
-
88
- \*HanBERT์˜ Size๋Š” Bert Model๊ณผ Tokenizer DB๋ฅผ ํ•ฉ์นœ ๊ฒƒ์ž…๋‹ˆ๋‹ค.
89
-
90
- \***config์˜ ์„ธํŒ…์„ ๊ทธ๋Œ€๋กœ ํ•˜์—ฌ ๋Œ๋ฆฐ ๊ฒฐ๊ณผ์ด๋ฉฐ, hyperparameter tuning์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•  ์‹œ ๋” ์ข‹์€ ์„ฑ๋Šฅ์ด ๋‚˜์˜ฌ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.**
91
-
92
- ## How to use
93
-
94
- ### Requirements
95
-
96
- - `pytorch <= 1.8.0`
97
- - `transformers ~= 3.0.1`
98
- - `transformers ~= 4.0.0` ๋„ ํ˜ธํ™˜๋ฉ๋‹ˆ๋‹ค.
99
- - `emoji ~= 0.6.0`
100
- - `soynlp ~= 0.0.493`
101
-
102
- ```python
103
- from transformers import AutoTokenizer, AutoModelWithLMHead
104
-
105
- # Base Model (108M)
106
-
107
- tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-base")
108
-
109
- model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-base")
110
-
111
- # Large Model (334M)
112
-
113
- tokenizer = AutoTokenizer.from_pretrained("beomi/kcbert-large")
114
-
115
- model = AutoModelWithLMHead.from_pretrained("beomi/kcbert-large")
116
- ```
117
-
118
- ### Pretrain & Finetune Colab ๋งํฌ ๋ชจ์Œ
119
-
120
- #### Pretrain Data
121
-
122
- - [๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ(Kaggle, ๋‹จ์ผํŒŒ์ผ, ๋กœ๊ทธ์ธ ํ•„์š”)](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)
123
- - [๋ฐ์ดํ„ฐ์…‹ ๋‹ค์šด๋กœ๋“œ(Github, ์••์ถ• ์—ฌ๋ŸฌํŒŒ์ผ, ๋กœ๊ทธ์ธ ๋ถˆํ•„์š”)](https://github.com/Beomi/KcBERT/releases/tag/TrainData_v1)
124
-
125
- #### Pretrain Code
126
-
127
- Colab์—์„œ TPU๋กœ KcBERT Pretrain ํ•ด๋ณด๊ธฐ: <a href="https://colab.research.google.com/drive/1lYBYtaXqt9S733OXdXvrvC09ysKFN30W">
128
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
129
- </a>
130
-
131
- #### Finetune Samples
132
-
133
- **KcBERT-Base** NSMC Finetuning with PyTorch-Lightning (Colab) <a href="https://colab.research.google.com/drive/1fn4sVJ82BrrInjq6y5655CYPP-1UKCLb?usp=sharing">
134
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
135
- </a>
136
-
137
- **KcBERT-Large** NSMC Finetuning with PyTorch-Lightning (Colab) <a href="https://colab.research.google.com/drive/1dFC0FL-521m7CL_PSd8RLKq67jgTJVhL?usp=sharing">
138
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
139
- </a>
140
-
141
- > ์œ„ ๋‘ ์ฝ”๋“œ๋Š” Pretrain ๋ชจ๋ธ(base, large)์™€ batch size๋งŒ ๋‹ค๋ฅผ ๋ฟ, ๋‚˜๋จธ์ง€ ์ฝ”๋“œ๋Š” ์™„์ „ํžˆ ๋™์ผํ•ฉ๋‹ˆ๋‹ค.
142
-
143
- ## Train Data & Preprocessing
144
-
145
- ### Raw Data
146
-
147
- ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” 2019.01.01 ~ 2020.06.15 ์‚ฌ์ด์— ์ž‘์„ฑ๋œ **๋Œ“๊ธ€ ๋งŽ์€ ๋‰ด์Šค** ๊ธฐ์‚ฌ๋“ค์˜ **๋Œ“๊ธ€๊ณผ ๋Œ€๋Œ“๊ธ€**์„ ๋ชจ๋‘ ์ˆ˜์ง‘ํ•œ ๋ฐ์ดํ„ฐ์ž…๋‹ˆ๋‹ค.
148
-
149
- ๋ฐ์ดํ„ฐ ์‚ฌ์ด์ฆˆ๋Š” ํ…์ŠคํŠธ๋งŒ ์ถ”์ถœ์‹œ **์•ฝ 15.4GB์ด๋ฉฐ, 1์–ต1์ฒœ๋งŒ๊ฐœ ์ด์ƒ์˜ ๋ฌธ์žฅ**์œผ๋กœ ์ด๋ค„์ ธ ์žˆ์Šต๋‹ˆ๋‹ค.
150
-
151
- ### Preprocessing
152
-
153
- PLM ํ•™์Šต์„ ์œ„ํ•ด์„œ ์ „์ฒ˜๋ฆฌ๋ฅผ ์ง„ํ–‰ํ•œ ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค.
154
-
155
- 1. ํ•œ๊ธ€ ๋ฐ ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž, ๊ทธ๋ฆฌ๊ณ  ์ด๋ชจ์ง€(๐Ÿฅณ)๊นŒ์ง€!
156
-
157
- ์ •๊ทœํ‘œํ˜„์‹์„ ํ†ตํ•ด ํ•œ๊ธ€, ์˜์–ด, ํŠน์ˆ˜๋ฌธ์ž๋ฅผ ํฌํ•จํ•ด Emoji๊นŒ์ง€ ํ•™์Šต ๋Œ€์ƒ์— ํฌํ•จํ–ˆ์Šต๋‹ˆ๋‹ค.
158
-
159
- ํ•œํŽธ, ํ•œ๊ธ€ ๋ฒ”์œ„๋ฅผ `ใ„ฑ-ใ…Ž๊ฐ€-ํžฃ` ์œผ๋กœ ์ง€์ •ํ•ด `ใ„ฑ-ํžฃ` ๋‚ด์˜ ํ•œ์ž๋ฅผ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.
160
-
161
- 2. ๋Œ“๊ธ€ ๋‚ด ์ค‘๋ณต ๋ฌธ์ž์—ด ์ถ•์•ฝ
162
-
163
- `ใ…‹ใ…‹ใ…‹ใ…‹ใ…‹`์™€ ๊ฐ™์ด ์ค‘๋ณต๋œ ๊ธ€์ž๋ฅผ `ใ…‹ใ…‹`์™€ ๊ฐ™์€ ๊ฒƒ์œผ๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.
164
-
165
- 3. Cased Model
166
-
167
- KcBERT๋Š” ์˜๋ฌธ์— ๋Œ€ํ•ด์„œ๋Š” ๋Œ€์†Œ๋ฌธ์ž๋ฅผ ์œ ์ง€ํ•˜๋Š” Cased model์ž…๋‹ˆ๋‹ค.
168
-
169
- 4. ๊ธ€์ž ๋‹จ์œ„ 10๊ธ€์ž ์ดํ•˜ ์ œ๊ฑฐ
170
-
171
- 10๊ธ€์ž ๋ฏธ๋งŒ์˜ ํ…์ŠคํŠธ๋Š” ๋‹จ์ผ ๋‹จ์–ด๋กœ ์ด๋ค„์ง„ ๊ฒฝ์šฐ๊ฐ€ ๋งŽ์•„ ํ•ด๋‹น ๋ถ€๋ถ„์„ ์ œ์™ธํ–ˆ์Šต๋‹ˆ๋‹ค.
172
-
173
- 5. ์ค‘๋ณต ์ œ๊ฑฐ
174
-
175
- ์ค‘๋ณต์ ์œผ๋กœ ์“ฐ์ธ ๋Œ“๊ธ€์„ ์ œ๊ฑฐํ•˜๊ธฐ ์œ„ํ•ด ์ค‘๋ณต ๋Œ“๊ธ€์„ ํ•˜๋‚˜๋กœ ํ•ฉ์ณค์Šต๋‹ˆ๋‹ค.
176
-
177
- ์ด๋ฅผ ํ†ตํ•ด ๋งŒ๋“  ์ตœ์ข… ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” **12.5GB, 8.9์ฒœ๋งŒ๊ฐœ ๋ฌธ์žฅ**์ž…๋‹ˆ๋‹ค.
178
-
179
- ์•„๋ž˜ ๋ช…๋ น์–ด๋กœ pip๋กœ ์„ค์น˜ํ•œ ๋’ค, ์•„๋ž˜ cleanํ•จ์ˆ˜๋กœ ํด๋ฆฌ๋‹์„ ํ•˜๋ฉด Downstream task์—์„œ ๋ณด๋‹ค ์„ฑ๋Šฅ์ด ์ข‹์•„์ง‘๋‹ˆ๋‹ค. (`[UNK]` ๊ฐ์†Œ)
180
-
181
- ```bash
182
- pip install soynlp emoji
183
- ```
184
-
185
- ์•„๋ž˜ `clean` ํ•จ์ˆ˜๋ฅผ Text data์— ์‚ฌ์šฉํ•ด์ฃผ์„ธ์š”.
186
-
187
- ```python
188
- import re
189
- import emoji
190
- from soynlp.normalizer import repeat_normalize
191
-
192
- emojis = list({y for x in emoji.UNICODE_EMOJI.values() for y in x.keys()})
193
- emojis = ''.join(emojis)
194
- pattern = re.compile(f'[^ .,?!/@$%~๏ผ…ยทโˆผ()\x00-\x7Fใ„ฑ-ใ…ฃ๊ฐ€-ํžฃ{emojis}]+')
195
- url_pattern = re.compile(
196
- r'https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)')
197
-
198
- def clean(x):
199
- x = pattern.sub(' ', x)
200
- x = url_pattern.sub('', x)
201
- x = x.strip()
202
- x = repeat_normalize(x, num_repeats=2)
203
- return x
204
- ```
205
-
206
- ### Cleaned Data (Released on Kaggle)
207
-
208
- ์›๋ณธ ๋ฐ์ดํ„ฐ๋ฅผ ์œ„ `clean`ํ•จ์ˆ˜๋กœ ์ •์ œํ•œ 12GB๋ถ„๋Ÿ‰์˜ txt ํŒŒ์ผ์„ ์•„๋ž˜ Kaggle Dataset์—์„œ ๋‹ค์šด๋ฐ›์œผ์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค :)
209
-
210
- https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments
211
-
212
-
213
- ## Tokenizer Train
214
-
215
- Tokenizer๋Š” Huggingface์˜ [Tokenizers](https://github.com/huggingface/tokenizers) ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ๋ฅผ ํ†ตํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
216
-
217
- ๊ทธ ์ค‘ `BertWordPieceTokenizer` ๋ฅผ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , Vocab Size๋Š” `30000`์œผ๋กœ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
218
-
219
- Tokenizer๋ฅผ ํ•™์Šตํ•˜๋Š” ๊ฒƒ์—๋Š” `1/10`๋กœ ์ƒ˜ํ”Œ๋งํ•œ ๋ฐ์ดํ„ฐ๋กœ ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ๋ณด๋‹ค ๊ณจ๊ณ ๋ฃจ ์ƒ˜ํ”Œ๋งํ•˜๊ธฐ ์œ„ํ•ด ์ผ์ž๋ณ„๋กœ stratify๋ฅผ ์ง€์ •ํ•œ ๋’ค ํ–‘์Šต์„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค.
220
-
221
- ## BERT Model Pretrain
222
-
223
- - KcBERT Base config
224
-
225
- ```json
226
- {
227
- "max_position_embeddings": 300,
228
- "hidden_dropout_prob": 0.1,
229
- "hidden_act": "gelu",
230
- "initializer_range": 0.02,
231
- "num_hidden_layers": 12,
232
- "type_vocab_size": 2,
233
- "vocab_size": 30000,
234
- "hidden_size": 768,
235
- "attention_probs_dropout_prob": 0.1,
236
- "directionality": "bidi",
237
- "num_attention_heads": 12,
238
- "intermediate_size": 3072,
239
- "architectures": [
240
- "BertForMaskedLM"
241
- ],
242
- "model_type": "bert"
243
- }
244
- ```
245
-
246
- - KcBERT Large config
247
-
248
- ```json
249
- {
250
- "type_vocab_size": 2,
251
- "initializer_range": 0.02,
252
- "max_position_embeddings": 300,
253
- "vocab_size": 30000,
254
- "hidden_size": 1024,
255
- "hidden_dropout_prob": 0.1,
256
- "model_type": "bert",
257
- "directionality": "bidi",
258
- "pad_token_id": 0,
259
- "layer_norm_eps": 1e-12,
260
- "hidden_act": "gelu",
261
- "num_hidden_layers": 24,
262
- "num_attention_heads": 16,
263
- "attention_probs_dropout_prob": 0.1,
264
- "intermediate_size": 4096,
265
- "architectures": [
266
- "BertForMaskedLM"
267
- ]
268
- }
269
- ```
270
-
271
- BERT Model Config๋Š” Base, Large ๊ธฐ๋ณธ ์„ธํŒ…๊ฐ’์„ ๊ทธ๋Œ€๋กœ ์‚ฌ์šฉํ–ˆ์Šต๋‹ˆ๋‹ค. (MLM 15% ๋“ฑ)
272
-
273
- TPU `v3-8` ์„ ์ด์šฉํ•ด ๊ฐ๊ฐ 3์ผ, N์ผ(Large๋Š” ํ•™์Šต ์ง„ํ–‰ ์ค‘)์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ˜„์žฌ Huggingface์— ๊ณต๊ฐœ๋œ ๋ชจ๋ธ์€ 1m(100๋งŒ) step์„ ํ•™์Šตํ•œ ckpt๊ฐ€ ์—…๋กœ๋“œ ๋˜์–ด์žˆ์Šต๋‹ˆ๋‹ค.
274
-
275
- ๋ชจ๋ธ ํ•™์Šต Loss๋Š” Step์— ๋”ฐ๋ผ ์ดˆ๊ธฐ 200k์— ๊ฐ€์žฅ ๋น ๋ฅด๊ฒŒ Loss๊ฐ€ ์ค„์–ด๋“ค๋‹ค 400k์ดํ›„๋กœ๋Š” ์กฐ๊ธˆ์”ฉ ๊ฐ์†Œํ•˜๋Š” ๊ฒƒ์„ ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
276
-
277
- - Base Model Loss
278
-
279
- ![KcBERT-Base Pretraining Loss](https://raw.githubusercontent.com/Beomi/KcBERT/master/img/image-20200719183852243.38b124.png)
280
-
281
- - Large Model Loss
282
-
283
- ![KcBERT-Large Pretraining Loss](https://raw.githubusercontent.com/Beomi/KcBERT/master/img/image-20200806160746694.d56fa1.png)
284
-
285
- ํ•™์Šต์€ GCP์˜ TPU v3-8์„ ์ด์šฉํ•ด ํ•™์Šต์„ ์ง„ํ–‰ํ–ˆ๊ณ , ํ•™์Šต ์‹œ๊ฐ„์€ Base Model ๊ธฐ์ค€ 2.5์ผ์ •๋„ ์ง„ํ–‰ํ–ˆ์Šต๋‹ˆ๋‹ค. Large Model์€ ์•ฝ 5์ผ์ •๋„ ์ง„ํ–‰ํ•œ ๋’ค ๊ฐ€์žฅ ๋‚ฎ์€ loss๋ฅผ ๊ฐ€์ง„ ์ฒดํฌํฌ์ธํŠธ๋กœ ์ •ํ–ˆ์Šต๋‹ˆ๋‹ค.
286
-
287
- ## Example
288
-
289
- ### HuggingFace MASK LM
290
-
291
- [HuggingFace kcbert-base ๋ชจ๋ธ](https://huggingface.co/beomi/kcbert-base?text=์˜ค๋Š˜์€+๋‚ ์”จ๊ฐ€+[MASK]) ์—์„œ ์•„๋ž˜์™€ ๊ฐ™์ด ํ…Œ์ŠคํŠธ ํ•ด ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
292
-
293
- ![์˜ค๋Š˜์€ ๋‚ ์”จ๊ฐ€ "์ข‹๋„ค์š”", KcBERT-Base](https://raw.githubusercontent.com/Beomi/KcBERT/master/img/image-20200719205919389.5670d6.png)
294
-
295
- ๋ฌผ๋ก  [kcbert-large ๋ชจ๋ธ](https://huggingface.co/beomi/kcbert-large?text=์˜ค๋Š˜์€+๋‚ ์”จ๊ฐ€+[MASK]) ์—์„œ๋„ ํ…Œ์ŠคํŠธ ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
296
-
297
- ![image-20200806160624340](https://raw.githubusercontent.com/Beomi/KcBERT/master/img/image-20200806160624340.58f9be.png)
298
-
299
-
300
-
301
- ### NSMC Binary Classification
302
-
303
- [๋„ค์ด๋ฒ„ ์˜ํ™”ํ‰ ์ฝ”ํผ์Šค](https://github.com/e9t/nsmc) ๋ฐ์ดํ„ฐ์…‹์„ ๋Œ€์ƒ์œผ๋กœ Fine Tuning์„ ์ง„ํ–‰ํ•ด ์„ฑ๋Šฅ์„ ๊ฐ„๋‹จํžˆ ํ…Œ์ŠคํŠธํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.
304
-
305
- Base Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” <a href="https://colab.research.google.com/drive/1fn4sVJ82BrrInjq6y5655CYPP-1UKCLb?usp=sharing">
306
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
307
- </a> ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณด์‹ค ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
308
-
309
- Large Model์„ Fine Tuneํ•˜๋Š” ์ฝ”๋“œ๋Š” <a href="https://colab.research.google.com/drive/1dFC0FL-521m7CL_PSd8RLKq67jgTJVhL?usp=sharing">
310
- <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
311
- </a> ์—์„œ ์ง์ ‘ ์‹คํ–‰ํ•ด๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
312
-
313
- - GPU๋Š” P100 x1๋Œ€ ๊ธฐ์ค€ 1epoch์— 2-3์‹œ๊ฐ„, TPU๋Š” 1epoch์— 1์‹œ๊ฐ„ ๋‚ด๋กœ ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
314
- - GPU RTX Titan x4๋Œ€ ๊ธฐ์ค€ 30๋ถ„/epoch ์†Œ์š”๋ฉ๋‹ˆ๋‹ค.
315
- - ์˜ˆ์‹œ ์ฝ”๋“œ๋Š” [pytorch-lightning](https://github.com/PyTorchLightning/pytorch-lightning)์œผ๋กœ ๊ฐœ๋ฐœํ–ˆ์Šต๋‹ˆ๋‹ค.
316
-
317
- #### ์‹คํ—˜๊ฒฐ๊ณผ
318
-
319
- - KcBERT-Base Model ์‹คํ—˜๊ฒฐ๊ณผ: Val acc `.8905`
320
-
321
- ![KcBERT Base finetune on NSMC](https://raw.githubusercontent.com/Beomi/KcBERT/master/img/image-20200719201102895.ddbdfc.png)
322
-
323
- - KcBERT-Large Model ์‹คํ—˜ ๊ฒฐ๊ณผ: Val acc `.9089`
324
-
325
- ![image-20200806190242834](https://raw.githubusercontent.com/Beomi/KcBERT/master/img/image-20200806190242834.56d6ee.png)
326
-
327
- > ๋” ๋‹ค์–‘ํ•œ Downstream Task์— ๋Œ€ํ•ด ํ…Œ์ŠคํŠธ๋ฅผ ์ง„ํ–‰ํ•˜๊ณ  ๊ณต๊ฐœํ•  ์˜ˆ์ •์ž…๋‹ˆ๋‹ค.
328
-
329
- ## ์ธ์šฉํ‘œ๊ธฐ/Citation
330
-
331
- KcBERT๋ฅผ ์ธ์šฉํ•˜์‹ค ๋•Œ๋Š” ์•„๋ž˜ ์–‘์‹์„ ํ†ตํ•ด ์ธ์šฉํ•ด์ฃผ์„ธ์š”.
332
-
333
- ```
334
- @inproceedings{lee2020kcbert,
335
- title={KcBERT: Korean Comments BERT},
336
- author={Lee, Junbum},
337
- booktitle={Proceedings of the 32nd Annual Conference on Human and Cognitive Language Technology},
338
- pages={437--440},
339
- year={2020}
340
- }
341
- ```
342
-
343
- - ๋…ผ๋ฌธ์ง‘ ๋‹ค์šด๋กœ๋“œ ๋งํฌ: http://hclt.kr/dwn/?v=bG5iOmNvbmZlcmVuY2U7aWR4OjMy (*ํ˜น์€ http://hclt.kr/symp/?lnb=conference )
344
-
345
- ## Acknowledgement
346
-
347
- KcBERT Model์„ ํ•™์Šตํ•˜๋Š” GCP/TPU ํ™˜๊ฒฝ์€ [TFRC](https://www.tensorflow.org/tfrc?hl=ko) ํ”„๋กœ๊ทธ๋žจ์˜ ์ง€์›์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค.
348
-
349
- ๋ชจ๋ธ ํ•™์Šต ๊ณผ์ •์—์„œ ๋งŽ์€ ์กฐ์–ธ์„ ์ฃผ์‹  [Monologg](https://github.com/monologg/) ๋‹˜ ๊ฐ์‚ฌํ•ฉ๋‹ˆ๋‹ค :)
350
-
351
- ## Reference
352
-
353
- ### Github Repos
354
-
355
- - [BERT by Google](https://github.com/google-research/bert)
356
- - [KoBERT by SKT](https://github.com/SKTBrain/KoBERT)
357
- - [KoELECTRA by Monologg](https://github.com/monologg/KoELECTRA/)
358
-
359
- - [Transformers by Huggingface](https://github.com/huggingface/transformers)
360
- - [Tokenizers by Hugginface](https://github.com/huggingface/tokenizers)
361
-
362
- ### Papers
363
-
364
- - [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/abs/1810.04805)
365
-
366
- ### Blogs
367
-
368
- - [Monologg๋‹˜์˜ KoELECTRA ํ•™์Šต๊ธฐ](https://monologg.kr/categories/NLP/ELECTRA/)
369
- - [Colab์—์„œ TPU๋กœ BERT ์ฒ˜์Œ๋ถ€ํ„ฐ ํ•™์Šต์‹œํ‚ค๊ธฐ - Tensorflow/Google ver.](https://beomi.github.io/2020/02/26/Train-BERT-from-scratch-on-colab-TPU-Tensorflow-ver/)
370
-
 
3
  license: apache-2.0
4
  tags:
5
  - korean
6
+ ---