File size: 1,291 Bytes
ed78802
1a1d527
ed78802
 
 
 
9a7f6ea
ed78802
9a7f6ea
ed78802
 
 
cdf55ff
 
 
 
 
 
 
 
 
ed78802
 
 
 
 
 
cdf55ff
 
ed78802
cdf55ff
 
 
 
 
 
 
 
 
 
ed78802
cdf55ff
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
---
license: apache-2.0
language: ko
tags:
  - korean
  - lassl
mask_token: "<mask>"
widget:
  - text: 대한민국의 수도는 <mask> 입니다.
---

# LASSL roberta-ko-small
## How to use

```python
from transformers import AutoModel, AutoTokenizer
model = AutoModel.from_pretrained("lassl/roberta-ko-small")
tokenizer = AutoTokenizer.from_pretrained("lassl/roberta-ko-small")
```

## Evaluation
Pretrained `roberta-ko-small` on korean language was trained by [LASSL](https://github.com/lassl/lassl) framework. Below performance was evaluated at 2021/12/15.

| nsmc | klue_nli | klue_sts | korquadv1 | klue_mrc | avg |
| ---- | -------- | -------- | --------- | ---- | -------- |
| 87.8846 | 66.3086 | 83.8353 | 83.1780 | 42.4585 | 72.7330 |

## Corpora
This model was trained from 6,860,062 examples (whose have 3,512,351,744 tokens). 6,860,062 examples are extracted from below corpora. If you want to get information for training, you should see `config.json`.  

```bash
corpora/
├── [707M]  kowiki_latest.txt
├── [ 26M]  modu_dialogue_v1.2.txt
├── [1.3G]  modu_news_v1.1.txt
├── [9.7G]  modu_news_v2.0.txt
├── [ 15M]  modu_np_v1.1.txt
├── [1008M]  modu_spoken_v1.2.txt
├── [6.5G]  modu_written_v1.0.txt
└── [413M]  petition.txt
```