metadata
language: ko
license: apache-2.0
datasets:
- wordpiece
- everyones-corpus
tags:
- korean
KoELECTRA v3 (Base Discriminator)
Table of Contents
- Model Details
- How To Get Started With the Model
- Uses
- Limitations
- Training
- Evaluation Results
- Environmental Impact
- Citation Information
Model Details
- Model Description:
KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (
koelectra-base-v3-discriminator
). ELECTRA uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.) - Developed by: Jangwon Park
- Model type:
- Language(s): Korean
- License: Apache 2.0
- Related Models:
- Resources for more information: For more detail, please see original repository.
How to Get Started with the Model
Load model and tokenizer
>>> from transformers import ElectraModel, ElectraTokenizer
>>> model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
Tokenizer example
>>> from transformers import ElectraTokenizer
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'])
[2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]
Example using ElectraForPreTraining
import torch
from transformers import ElectraForPreTraining, ElectraTokenizer
discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-base-v3-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
sentence = "나는 방금 밥을 먹었다."
fake_sentence = "나는 내일 밥을 먹었다."
fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")
discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
Uses
Direct Use
Misuse, Malicious Use, and Out-of-Scope Use
Limitations and Bias
Limitations
Bias
Training
KoELECTRA is trained with 34GB Korean text, KoELECTRA uses Wordpiece and model is uploaded on s3.
Training Data
- Layers: 12
- Embedding Size: 768
- Hidden Size: 768
- Number of heads: 12
Vocabulary: “WordPiece” vocabulary was used
Vocab-Length | Do-Lower-Case | |
---|---|---|
V3 | 35000 | False |
For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)
Training Procedure
Pretraining
- Batch Size: 256
- Training Steps: 1.5M
- LR: 2e-4
- Max Sequence Length: 512
- Training Time: 14 days
Evaluation
Results
The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base in their git repository
This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.
- Size: 421M
- NSMC (acc): 90.63
- Naver NER (F1): 88.11
- PAWS (acc): 84.45
- KorNLI (acc): 82.24
- KorSTS (spearman): 85.53
- Question Pair (acc): 95.25
- KorQuaD (Dev) (EM/F1): 84.83/93.45
- Korean-Hate-Speech (Dev) (F1): 67.61
KoELECTRA v3 (Base Discriminator) Estimated Emissions
You can estimate carbon emissions using the Machine Learning Impact calculator presented in Lacoste et al. (2019)
- Hardware Type: TPU v3-8
- Hours used: 336 hours (14 days)
- Cloud Provider: GCP (Google Cloud Provider)
- Compute Region: europe-west4-a
- Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 54.2 kg of CO2eq
Citation
@misc{park2020koelectra,
author = {Park, Jangwon},
title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
year = {2020},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/monologg/KoELECTRA}}
}