Update README.md
Browse files
README.md
CHANGED
@@ -1,3 +1,62 @@
|
|
1 |
-
---
|
2 |
-
license: apache-2.0
|
3 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
license: apache-2.0
|
3 |
+
language:
|
4 |
+
- ko
|
5 |
+
tags:
|
6 |
+
- arxiv:1909.11942
|
7 |
+
---
|
8 |
+
|
9 |
+
# Korean ALBERT
|
10 |
+
|
11 |
+
# Dataset
|
12 |
+
- [AI-HUB](https://www.aihub.or.kr/)
|
13 |
+
- [국립국어원 - 모두의 말뭉치](https://kli.korean.go.kr/corpus/main/requestMain.do?lang=ko)
|
14 |
+
- [Korean News Comments](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)
|
15 |
+
|
16 |
+
|
17 |
+
# Evaluation results
|
18 |
+
- The code for finetuning can be found at [KcBERT-Finetune](https://github.com/Beomi/KcBERT-finetune).
|
19 |
+
|
20 |
+
| | Size(용량) | Average Score | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |
|
21 |
+
|:---------------------- |:----------:|:-------------:|:------------------:|:----------------------:|:------------------:|:--------------------:|:-------------------------:|:---------------------------:|:-----------------------------:|
|
22 |
+
| KcELECTRA-base | 475M | 84.84 | 91.71 | 86.90 | 74.80 | 81.65 | 82.65 | **95.78** | 70.60 / 90.11 |
|
23 |
+
| KcELECTRA-base-v2022 | 475M | 85.20 | **91.97** | **87.35** | 76.50 | **82.12** | **83.67** | 95.12 | 69.00 / 90.40 |
|
24 |
+
| KcBERT-Base | 417M | 79.65 | 89.62 | 84.34 | 66.95 | 74.85 | 75.57 | 93.93 | 60.25 / 84.39 |
|
25 |
+
| KcBERT-Large | 1.2G | 81.33 | 90.68 | 85.53 | 70.15 | 76.99 | 77.49 | 94.06 | 62.16 / 86.64 |
|
26 |
+
| KoBERT | 351M | 82.21 | 89.63 | 86.11 | 80.65 | 79.00 | 79.64 | 93.93 | 52.81 / 80.27 |
|
27 |
+
| XLM-Roberta-Base | 1.03G | 84.01 | 89.49 | 86.26 | 82.95 | 79.92 | 79.09 | 93.53 | 64.70 / 88.94 |
|
28 |
+
| HanBERT | 614M | 86.24 | 90.16 | 87.31 | 82.40 | 80.89 | 83.33 | 94.19 | 78.74 / 92.02 |
|
29 |
+
| KoELECTRA-Base | 423M | 84.66 | 90.21 | 86.87 | 81.90 | 80.85 | 83.21 | 94.20 | 61.10 / 89.59 |
|
30 |
+
| KoELECTRA-Base-v2 | 423M | **86.96** | 89.70 | 87.02 | **83.90** | 80.61 | 84.30 | 94.72 | **84.34 / 92.58** |
|
31 |
+
| DistilKoBERT | 108M | 76.76 | 88.41 | 84.13 | 62.55 | 70.55 | 73.21 | 92.48 | 54.12 / 77.80 |
|
32 |
+
| **ko-albert-base-v1** | **51M** | 80.46 | 86.83 | 82.26 | 69.95 | 74.17 | 74.48 | 94.06 | 76.08 / 86.82 |
|
33 |
+
| **ko-albert-large-v1** | **75M** | 82.39 | 86.91 | 83.12 | 76.10 | 76.01 | 77.46 | 94.33 | 77.64 / 87.99 |
|
34 |
+
|
35 |
+
*The size of HanBERT is the sum of the BERT model and the tokenizer DB.
|
36 |
+
|
37 |
+
*These results were obtained using the default configuration settings. Better performance may be achieved with additional hyperparameter tuning.
|
38 |
+
|
39 |
+
|
40 |
+
# How to use
|
41 |
+
|
42 |
+
```python
|
43 |
+
from transformers import AutoTokenizer, AutoModel
|
44 |
+
|
45 |
+
# Base Model (51M)
|
46 |
+
tokenizer = AutoTokenizer.from_pretrained("lots-o/ko-albert-base-v1")
|
47 |
+
model = AutoModel.from_pretrained("lots-o/ko-albert-base-v1")
|
48 |
+
|
49 |
+
# Large Model (75M)
|
50 |
+
tokenizer = AutoTokenizer.from_pretrained("lots-o/ko-albert-large-v1")
|
51 |
+
model = AutoModel.from_pretrained("lots-o/ko-albert-large-v1")
|
52 |
+
```
|
53 |
+
|
54 |
+
# Acknowledgement
|
55 |
+
- The GCP/TPU environment used for training the ALBERT Model was supported by the [TRC](https://sites.research.google/trc/about/) program.
|
56 |
+
|
57 |
+
# Reference
|
58 |
+
- [google-albert](https://github.com/google-research/albert)
|
59 |
+
- [albert-zh](https://github.com/brightmart/albert_zh)
|
60 |
+
- [KcBERT](https://github.com/Beomi/KcBERT)
|
61 |
+
- [KcBERT-Finetune](https://github.com/Beomi/KcBERT-finetune)
|
62 |
+
|