metadata

language: ko
license: apache-2.0
datasets:
  - wordpiece
  - everyones-corpus
tags:
  - korean

KoELECTRA v3 (Base Discriminator)

Model Details
How To Get Started With the Model
Uses
Limitations
Training
Evaluation Results
Environmental Impact
Citation Information

Model Details

Model Description: KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (koelectra-base-v3-discriminator). ELECTRA uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.)
Developed by: Jangwon Park
Model type:
Language(s): Korean
License: Apache 2.0
Related Models:
Resources for more information: For more detail, please see original repository.

How to Get Started with the Model

Load model and tokenizer

>>> from transformers import ElectraModel, ElectraTokenizer

>>> model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

Tokenizer example

>>> from transformers import ElectraTokenizer
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'])
[2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]

Example using ElectraForPreTraining

import torch
from transformers import ElectraForPreTraining, ElectraTokenizer

discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-base-v3-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

sentence = "나는 방금 밥을 먹었다."
fake_sentence = "나는 내일 밥을 먹었다."

fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")

discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

print(list(zip(fake_tokens, predictions.tolist()[1:-1])))

Uses

Direct Use

Misuse, Malicious Use, and Out-of-Scope Use

Limitations and Bias

Limitations

Bias

Training

KoELECTRA is trained with 34GB Korean text, KoELECTRA uses Wordpiece and model is uploaded on s3.

Training Data

Layers: 12
Embedding Size: 768
Hidden Size: 768
Number of heads: 12

Vocabulary: “WordPiece” vocabulary was used

	Vocab-Length	Do-Lower-Case
V3	35000	False

For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)

Training Procedure

Pretraining

Batch Size: 256
Training Steps: 1.5M
LR: 2e-4
Max Sequence Length: 512
Training Time: 14 days

Evaluation

Results

The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base in their git repository

This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.

Size: 421M
NSMC (acc): 90.63
Naver NER (F1): 88.11
PAWS (acc): 84.45
KorNLI (acc): 82.24
KorSTS (spearman): 85.53
Question Pair (acc): 95.25
KorQuaD (Dev) (EM/F1): 84.83/93.45
Korean-Hate-Speech (Dev) (F1): 67.61