File size: 3,442 Bytes
e9b2c81
 
 
 
 
 
 
 
 
 
 
7886f25
 
 
 
f587c13
7886f25
 
 
 
efaf644
7886f25
f587c13
 
7886f25
 
 
 
 
 
 
fb5f0e5
7886f25
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f88303f
8b33b4b
22b97d5
f88303f
 
 
 
 
7886f25
 
 
86af455
7886f25
 
 
 
 
 
 
 
 
 
c0c43ec
86af455
c0c43ec
 
b8fc6ed
4b52d68
c0c43ec
 
 
 
 
 
 
 
4b52d68
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
---
license: apache-2.0
datasets:
- klue/klue
language:
- ko
metrics:
- f1
- accuracy
- pearsonr
---
# RoBERTa-base Korean

## ๋ชจ๋ธ ์„ค๋ช…
์ด RoBERTa ๋ชจ๋ธ์€ ๋‹ค์–‘ํ•œ ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ์…‹์—์„œ **์Œ์ ˆ** ๋‹จ์œ„๋กœ ์‚ฌ์ „ ํ•™์Šต๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
์ž์ฒด ๊ตฌ์ถ•ํ•œ ํ•œ๊ตญ์–ด ์Œ์ ˆ ๋‹จ์œ„ vocab์„ ์‚ฌ์šฉํ•˜์˜€์Šต๋‹ˆ๋‹ค.

## ์•„ํ‚คํ…์ฒ˜
- **๋ชจ๋ธ ์œ ํ˜•**: RoBERTa
- **์•„ํ‚คํ…์ฒ˜**: RobertaForMaskedLM
- **๋ชจ๋ธ ํฌ๊ธฐ**: 512 hidden size, 8 hidden layers, 8 attention heads
- **max_position_embeddings**: 514
- **intermediate_size**: 2,048
- **vocab_size**: 1,428

## ํ•™์Šต ๋ฐ์ดํ„ฐ
์‚ฌ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™์Šต๋‹ˆ๋‹ค:
- **๋ชจ๋‘์˜๋ง๋ญ‰์น˜**: ์ฑ„ํŒ…, ๊ฒŒ์‹œํŒ, ์ผ์ƒ๋Œ€ํ™”, ๋‰ด์Šค, ๋ฐฉ์†ก๋Œ€๋ณธ, ์ฑ… ๋“ฑ
- **AIHUB**: SNS, ์œ ํŠœ๋ธŒ ๋Œ“๊ธ€, ๋„์„œ ๋ฌธ์žฅ
- **๊ธฐํƒ€**: ๋‚˜๋ฌด์œ„ํ‚ค, ํ•œ๊ตญ์–ด ์œ„ํ‚คํ”ผ๋””์•„

 ์ด ํ•ฉ์‚ฐ๋œ ๋ฐ์ดํ„ฐ๋Š” **์•ฝ 11GB** ์ž…๋‹ˆ๋‹ค. **(4B tokens)**

## ํ•™์Šต ์ƒ์„ธ
- **BATCH_SIZE**: 196 (GPU๋‹น)
- **ACCUMULATE**: 20
- **Total_BATCH_SIZE**: 8232
- **MAX_STEPS**: 12,500
- **TRAIN_STEPS * BATCH_SIZE**: **100M**
- **WARMUP_STEPS**: 2,400
- **์ตœ์ ํ™”**: AdamW, LR 1e-3, BETA (0.9, 0.98), eps 1e-6
- **ํ•™์Šต๋ฅ  ๊ฐ์‡ **: linear
- **์‚ฌ์šฉ๋œ ํ•˜๋“œ์›จ์–ด**: 2x A6000ada GPU


![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/S-3zdDXVMZnyEVrZdQ7J3.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/3VwE53iLqKtc-gMQXOV_L.png)


## ์„ฑ๋Šฅ ํ‰๊ฐ€
- **KLUE benchmark test๋ฅผ ํ†ตํ•ด์„œ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ–ˆ์Šต๋‹ˆ๋‹ค.**
- klue-roberta-base์— ๋น„ํ•ด์„œ ๋งค์šฐ ์ž‘์€ ํฌ๊ธฐ๋ผ ์„ฑ๋Šฅ์ด ๋‚ฎ๊ธฐ๋Š” ํ•˜์ง€๋งŒ hidden size 512์ธ ๋ชจ๋ธ์€ ํฌ๊ธฐ ๋Œ€๋น„ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค.

![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/I8e60cf9w-IQCHDgKiooq.png)
![image/png](https://cdn-uploads.huggingface.co/production/uploads/64a0fd6fd3149e05bc5260dd/hkc5ko9Vo-pkKmtouN7xc.png)


## ์‚ฌ์šฉ ๋ฐฉ๋ฒ•
### tokenizer์˜ ๊ฒฝ์šฐ wordpiece๊ฐ€ ์•„๋‹Œ syllable ๋‹จ์œ„์ด๊ธฐ์— AutoTokenizer๊ฐ€ ์•„๋‹ˆ๋ผ SyllableTokenizer๋ฅผ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. 
### (๋ ˆํฌ์—์„œ ์ œ๊ณตํ•˜๊ณ  ์žˆ๋Š” syllabletokenizer.py๋ฅผ ๊ฐ€์ ธ์™€์„œ ์‚ฌ์šฉํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.)

```python
from transformers import AutoModel, AutoTokenizer
from syllabletokenizer import SyllableTokenizer

# ๋ชจ๋ธ๊ณผ ํ† ํฌ๋‚˜์ด์ € ๋ถˆ๋Ÿฌ์˜ค๊ธฐ
model = AutoModelForMaskedLM.from_pretrained("Trofish/korean_syllable_roberta")
tokenizer = SyllableTokenizer(vocab_file='vocab.json',**tokenizer_kwargs)

# ํ…์ŠคํŠธ๋ฅผ ํ† ํฐ์œผ๋กœ ๋ณ€ํ™˜ํ•˜๊ณ  ์˜ˆ์ธก ์ˆ˜ํ–‰
inputs = tokenizer("์—ฌ๊ธฐ์— ํ•œ๊ตญ์–ด ํ…์ŠคํŠธ ์ž…๋ ฅ", return_tensors="pt")
outputs = model(**inputs)
```

## Citation
**klue**
```
@misc{park2021klue,
      title={KLUE: Korean Language Understanding Evaluation}, 
      author={Sungjoon Park and Jihyung Moon and Sungdong Kim and Won Ik Cho and Jiyoon Han and Jangwon Park and Chisung Song and Junseong Kim and Yongsook Song and Taehwan Oh and Joohong Lee and Juhyun Oh and Sungwon Lyu and Younghoon Jeong and Inkwon Lee and Sangwoo Seo and Dongjun Lee and Hyunwoo Kim and Myeonghwa Lee and Seongbo Jang and Seungwon Do and Sunkyoung Kim and Kyungtae Lim and Jongwon Lee and Kyumin Park and Jamin Shin and Seonghyun Kim and Lucy Park and Alice Oh and Jungwoo Ha and Kyunghyun Cho},
      year={2021},
      eprint={2105.09680},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}
```