Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,46 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# RoBERTa-base Korean
|
2 |
+
|
3 |
+
## ๋ชจ๋ธ ์ค๋ช
|
4 |
+
์ด RoBERTa ๋ชจ๋ธ์ ๋ค์ํ ํ๊ตญ์ด ํ
์คํธ ๋ฐ์ดํฐ์
์์ **์์ ** ๋จ์๋ก ์ฌ์ ํ์ต๋์์ต๋๋ค.
|
5 |
+
์์ฒด ๊ตฌ์ถํ ํ๊ตญ์ด ์์ ๋จ์ vocab์ ์ฌ์ฉํ์์ต๋๋ค.
|
6 |
+
|
7 |
+
## ์ํคํ
์ฒ
|
8 |
+
- **๋ชจ๋ธ ์ ํ**: RoBERTa
|
9 |
+
- **์ํคํ
์ฒ**: RobertaForMaskedLM
|
10 |
+
- **๋ชจ๋ธ ํฌ๊ธฐ**: 128 hidden size, 8 hidden layers, 8 attention heads
|
11 |
+
- **max_position_embeddings**: 514
|
12 |
+
- **intermediate_size**: 2048
|
13 |
+
- **vocab_size**: 1428
|
14 |
+
|
15 |
+
## ํ์ต ๋ฐ์ดํฐ
|
16 |
+
์ฌ์ฉ๋ ๋ฐ์ดํฐ์
์ ๋ค์๊ณผ ๊ฐ์ต๋๋ค:
|
17 |
+
- **๋ชจ๋์๋ง๋ญ์น**: ์ฑํ
, ๊ฒ์ํ, ์ผ์๋ํ, ๋ด์ค, ๋ฐฉ์ก๋๋ณธ, ์ฑ
๋ฑ
|
18 |
+
- **AIHUB**: SNS, ์ ํ๋ธ ๋๊ธ, ๋์ ๋ฌธ์ฅ
|
19 |
+
- **๊ธฐํ**: ๋๋ฌด์ํค, ํ๊ตญ์ด ์ํคํผ๋์
|
20 |
+
|
21 |
+
์ด ํฉ์ฐ๋ ๋ฐ์ดํฐ๋ ์ฝ 11GB ์
๋๋ค.
|
22 |
+
|
23 |
+
## ํ์ต ์์ธ
|
24 |
+
- **BATCH_SIZE**: 112 (GPU๋น)
|
25 |
+
- **ACCUMULATE**: 36
|
26 |
+
- **MAX_STEPS**: 12,500
|
27 |
+
- **Train Steps*Batch Szie**: **100M**
|
28 |
+
- **WARMUP_STEPS**: 2,400
|
29 |
+
- **์ต์ ํ**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
|
30 |
+
- **ํ์ต๋ฅ ๊ฐ์ **: linear
|
31 |
+
- **์ฌ์ฉ๋ ํ๋์จ์ด**: 2x RTX 8000 GPU
|
32 |
+
|
33 |
+
## ์ฌ์ฉ ๋ฐฉ๋ฒ
|
34 |
+
### tokenizer์ ๊ฒฝ์ฐ wordpiece๊ฐ ์๋ syllable ๋จ์์ด๊ธฐ์ AutoTokenizer๊ฐ ์๋๋ผ SyllableTokenizer๋ฅผ ์ฌ์ฉํด์ผ ํฉ๋๋ค.
|
35 |
+
### (๋ ํฌ์์ ์ ๊ณตํ๊ณ ์๋ syllabletokenizer.py๋ฅผ ๊ฐ์ ธ์์ ์ฌ์ฉํด์ผ ํฉ๋๋ค.)
|
36 |
+
```python
|
37 |
+
from transformers import AutoModel, AutoTokenizer
|
38 |
+
from syllabletokenizer import SyllableTokenizer
|
39 |
+
|
40 |
+
# ๋ชจ๋ธ๊ณผ ํ ํฌ๋์ด์ ๋ถ๋ฌ์ค๊ธฐ
|
41 |
+
model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta")
|
42 |
+
tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)
|
43 |
+
|
44 |
+
# ํ
์คํธ๋ฅผ ํ ํฐ์ผ๋ก ๋ณํํ๊ณ ์์ธก ์ํ
|
45 |
+
inputs = tokenizer("์ฌ๊ธฐ์ ํ๊ตญ์ด ํ
์คํธ ์
๋ ฅ", return_tensors="pt")
|
46 |
+
outputs = model(**inputs)
|