|
--- |
|
license: apache-2.0 |
|
language: ko |
|
tags: |
|
- korean |
|
- lassl |
|
mask_token: "<mask>" |
|
widget: |
|
- text: 대한민국의 수도는 <mask> 입니다. |
|
--- |
|
|
|
# LASSL roberta-ko-small |
|
## How to use |
|
|
|
```python |
|
from transformers import AutoModel, AutoTokenizer |
|
model = AutoModel.from_pretrained("lassl/roberta-ko-small") |
|
tokenizer = AutoTokenizer.from_pretrained("lassl/roberta-ko-small") |
|
``` |
|
|
|
## Evaluation |
|
Pretrained `roberta-ko-small` on korean language was trained by [LASSL](https://github.com/lassl/lassl) framework. Below performance was evaluated at 2021/12/15. |
|
|
|
| nsmc | klue_nli | klue_sts | korquadv1 | klue_mrc | avg | |
|
| ---- | -------- | -------- | --------- | ---- | -------- | |
|
| 87.8846 | 66.3086 | 83.8353 | 83.1780 | 42.4585 | 72.7330 | |
|
|
|
## Corpora |
|
This model was trained from 6,860,062 examples (whose have 3,512,351,744 tokens). 6,860,062 examples are extracted from below corpora. If you want to get information for training, you should see `config.json`. |
|
|
|
```bash |
|
corpora/ |
|
├── [707M] kowiki_latest.txt |
|
├── [ 26M] modu_dialogue_v1.2.txt |
|
├── [1.3G] modu_news_v1.1.txt |
|
├── [9.7G] modu_news_v2.0.txt |
|
├── [ 15M] modu_np_v1.1.txt |
|
├── [1008M] modu_spoken_v1.2.txt |
|
├── [6.5G] modu_written_v1.0.txt |
|
└── [413M] petition.txt |
|
``` |
|
|
|
|
|
|
|
|