lots-o
/

ko-albert-large-v1

+---
+license: apache-2.0
+language:
+- ko
+---
+# Korean ALBERT
+# Dataset
+- [AI-HUB](https://www.aihub.or.kr/)
+- [국립국어원 - 모두의 말뭉치](https://kli.korean.go.kr/corpus/main/requestMain.do?lang=ko)
+- [Korean News Comments](https://www.kaggle.com/junbumlee/kcbert-pretraining-corpus-korean-news-comments)
+# Evaluation results
+- The code for finetuning can be found at [KcBERT-Finetune](https://github.com/Beomi/KcBERT-finetune).
+|                        | Size(용량) | Average Score | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) |
+|:---------------------- |:----------:|:-------------:|:------------------:|:----------------------:|:------------------:|:--------------------:|:-------------------------:|:---------------------------:|:-----------------------------:|
+| KcELECTRA-base         |    475M    |     84.84     |       91.71        |         86.90          |       74.80        |        81.65         |           82.65           |          **95.78**          |         70.60 / 90.11         |
+| KcELECTRA-base-v2022   |    475M    |     85.20     |     **91.97**      |       **87.35**        |       76.50        |      **82.12**       |         **83.67**         |            95.12            |         69.00 / 90.40         |
+| KcBERT-Base            |    417M    |     79.65     |       89.62        |         84.34          |       66.95        |        74.85         |           75.57           |            93.93            |         60.25 / 84.39         |
+| KcBERT-Large           |    1.2G    |     81.33     |       90.68        |         85.53          |       70.15        |        76.99         |           77.49           |            94.06            |         62.16 / 86.64         |
+| KoBERT                 |    351M    |     82.21     |       89.63        |         86.11          |       80.65        |        79.00         |           79.64           |            93.93            |         52.81 / 80.27         |
+| XLM-Roberta-Base       |   1.03G    |     84.01     |       89.49        |         86.26          |       82.95        |        79.92         |           79.09           |            93.53            |         64.70 / 88.94         |
+| HanBERT                |    614M    |     86.24     |       90.16        |         87.31          |       82.40        |        80.89         |           83.33           |            94.19            |         78.74 / 92.02         |
+| KoELECTRA-Base         |    423M    |     84.66     |       90.21        |         86.87          |       81.90        |        80.85         |           83.21           |            94.20            |         61.10 / 89.59         |
+| KoELECTRA-Base-v2      |    423M    |   **86.96**   |       89.70        |         87.02          |     **83.90**      |        80.61         |           84.30           |            94.72            |       **84.34 / 92.58**       |
+| DistilKoBERT           |    108M    |     76.76     |       88.41        |         84.13          |       62.55        |        70.55         |           73.21           |            92.48            |         54.12 / 77.80         |
+| **ko-albert-base-v1**  |  **51M**   |     80.46     |       86.83        |         82.26          |       69.95        |        74.17         |           74.48           |            94.06            |         76.08 / 86.82         |
+| **ko-albert-large-v1** |  **75M**   |     82.39     |       86.91        |         83.12          |       76.10        |        76.01         |           77.46           |            94.33            |         77.64 / 87.99         |
+*The size of HanBERT is the sum of the BERT model and the tokenizer DB.
+*These results were obtained using the default configuration settings. Better performance may be achieved with additional hyperparameter tuning.
+# How to use
+```python
+from transformers import AutoTokenizer, AutoModel
+# Base Model (51M)
+tokenizer = AutoTokenizer.from_pretrained("lots-o/ko-albert-base-v1")
+model = AutoModel.from_pretrained("lots-o/ko-albert-base-v1")
+# Large Model (75M)
+tokenizer = AutoTokenizer.from_pretrained("lots-o/ko-albert-large-v1")
+model = AutoModel.from_pretrained("lots-o/ko-albert-large-v1")
+```
+# Acknowledgement
+- The GCP/TPU environment used for training the ALBERT Model was supported by the [TRC](https://sites.research.google/trc/about/) program.
+# Reference
+## Paper
+- [ALBERT: A Lite BERT for Self-supervised Learning of Language Representations](https://arxiv.org/abs/1909.11942)
+## Github Repos
+- [google-albert](https://github.com/google-research/albert)
+- [albert-zh](https://github.com/brightmart/albert_zh)
+- [KcBERT](https://github.com/Beomi/KcBERT)
+- [KcBERT-Finetune](https://github.com/Beomi/KcBERT-finetune)