--- language: ko license: apache-2.0 datasets: - wordpiece - everyones-corpus tags: - korean --- # KoELECTRA v3 (Base Discriminator) ## Table of Contents 1. [Model Details](#model-details) 2. [How To Get Started With the Model](#how-to-get-started-with-the-model) 3. [Uses](#uses) 4. [Limitations](#limitations) 5. [Training](#training) 6. [Evaluation Results](#evaluation-results) 7. [Environmental Impact](#environmental-impact) 8. [Citation Information](#citation-information) ## Model Details * **Model Description:** KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (`koelectra-base-v3-discriminator`). [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.) * **Developed by:** Jangwon Park * **Model type:** * **Language(s):** Korean * **License:** Apache 2.0 * **Related Models:** * **Resources for more information:** For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md). ## How to Get Started with the Model ### Load model and tokenizer ```python >>> from transformers import ElectraModel, ElectraTokenizer >>> model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator") >>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator") ``` ### Tokenizer example ```python >>> from transformers import ElectraTokenizer >>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator") >>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]") ['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'] >>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']) [2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3] ``` ## Example using ElectraForPreTraining ```python import torch from transformers import ElectraForPreTraining, ElectraTokenizer discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-base-v3-discriminator") tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator") sentence = "나는 방금 밥을 먹었다." fake_sentence = "나는 내일 밥을 먹었다." fake_tokens = tokenizer.tokenize(fake_sentence) fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt") discriminator_outputs = discriminator(fake_inputs) predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2) print(list(zip(fake_tokens, predictions.tolist()[1:-1]))) ``` ## Uses #### Direct Use #### Misuse, Malicious Use, and Out-of-Scope Use ## Limitations and Bias #### Limitations #### Bias ## Training KoELECTRA is trained with 34GB Korean text, KoELECTRA uses [Wordpiece](https://github.com/monologg/KoELECTRA/blob/master/docs/wordpiece_vocab_EN.md) and model is uploaded on s3. ### Training Data * **Layers:** 12 * **Embedding Size:** 768 * **Hidden Size:** 768 * **Number of heads:** 12 Vocabulary: “WordPiece” vocabulary was used | | Vocab-Length | Do-Lower-Case | |:-:|:-------------:|:----------------:| |V3 | 35000 | False | For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web) ### Training Procedure #### Pretraining * **Batch Size:** 256 * **Training Steps:** 1.5M * **LR:** 2e-4 * **Max Sequence Length:** 512 * **Training Time:** 14 days ## Evaluation #### Results The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base [in their git repository](https://github.com/monologg/KoELECTRA/blob/master/finetune/README_EN.md) This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out. * **Size:** 421M * **NSMC (acc):** 90.63 * **Naver NER (F1):** 88.11 * **PAWS (acc):** 84.45 * **KorNLI (acc):** 82.24 * **KorSTS (spearman):** 85.53 * **Question Pair (acc):** 95.25 * **KorQuaD (Dev) (EM/F1):** 84.83/93.45 * **Korean-Hate-Speech (Dev) (F1):** 67.61 ### KoELECTRA v3 (Base Discriminator) Estimated Emissions You can estimate carbon emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700) * **Hardware Type:** TPU v3-8 * **Hours used:** 336 hours (14 days) * **Cloud Provider:** GCP (Google Cloud Provider) * **Compute Region:** europe-west4-a * **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 54.2 kg of CO2eq ## Citation ```bibtext @misc{park2020koelectra, author = {Park, Jangwon}, title = {KoELECTRA: Pretrained ELECTRA Model for Korean}, year = {2020}, publisher = {GitHub}, journal = {GitHub repository}, howpublished = {\url{https://github.com/monologg/KoELECTRA}} } ```