---
language: ko
license: apache-2.0
datasets:
- wordpiece
- everyones-corpus
tags:
  - korean
---


# KoELECTRA v3 (Base Discriminator)

## Table of Contents
1. [Model Details](#model-details)
2. [How To Get Started With the Model](#how-to-get-started-with-the-model)
3. [Uses](#uses)
4. [Limitations](#limitations)
5. [Training](#training)
6. [Evaluation Results](#evaluation-results)
7. [Environmental Impact](#environmental-impact)
8. [Citation Information](#citation-information)

## Model Details
* **Model Description:** 
KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (`koelectra-base-v3-discriminator`). [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.)
* **Developed by:** Jangwon Park
* **Model type:** 
* **Language(s):** Korean
* **License:** Apache 2.0
* **Related Models:**
* **Resources for more information:** For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md).

## How to Get Started with the Model


### Load model and tokenizer

```python
>>> from transformers import ElectraModel, ElectraTokenizer

>>> model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
```

### Tokenizer example

```python
>>> from transformers import ElectraTokenizer
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'])
[2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]
```

## Example using ElectraForPreTraining

```python
import torch
from transformers import ElectraForPreTraining, ElectraTokenizer

discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-base-v3-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

sentence = "나는 방금 밥을 먹었다."
fake_sentence = "나는 내일 밥을 먹었다."

fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")

discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
```


## Uses

#### Direct Use

#### Misuse, Malicious Use, and Out-of-Scope Use


## Limitations and Bias

#### Limitations

#### Bias 


## Training
KoELECTRA is trained with 34GB Korean text, 
KoELECTRA uses [Wordpiece](https://github.com/monologg/KoELECTRA/blob/master/docs/wordpiece_vocab_EN.md) and model is uploaded on s3.

### Training Data

* **Layers:** 12
* **Embedding Size:** 768
* **Hidden Size:** 768
* **Number of heads:**  12

Vocabulary: “WordPiece” vocabulary was used

|   |	Vocab-Length | Do-Lower-Case    |  
|:-:|:-------------:|:----------------:|
|V3 |   35000	     |		False         |         

For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)

### Training Procedure

#### Pretraining 

* **Batch Size:** 256
* **Training Steps:** 1.5M 
* **LR:** 2e-4
* **Max Sequence Length:** 512
* **Training Time:**  14 days


## Evaluation


#### Results
The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base [in their git repository](https://github.com/monologg/KoELECTRA/blob/master/finetune/README_EN.md) 

This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.

* **Size:** 421M
* **NSMC (acc):** 90.63
* **Naver NER (F1):** 88.11
* **PAWS (acc):** 84.45
* **KorNLI (acc):** 82.24
* **KorSTS (spearman):** 85.53
* **Question Pair (acc):** 95.25
* **KorQuaD (Dev) (EM/F1):** 84.83/93.45
* **Korean-Hate-Speech (Dev) (F1):** 67.61


### KoELECTRA v3 (Base Discriminator) Estimated Emissions

You can estimate carbon emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700)

* **Hardware Type:** TPU v3-8
* **Hours used:** 336 hours (14 days)
* **Cloud Provider:** GCP (Google Cloud Provider)
* **Compute Region:**  europe-west4-a
* **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):**  54.2 kg of CO2eq


## Citation
```bibtext
@misc{park2020koelectra,
  author = {Park, Jangwon},
  title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/monologg/KoELECTRA}}
}

```