Ezi's picture
Update README.md
da54619
|
raw
history blame
5.25 kB
metadata
language: ko
license: apache-2.0
datasets:
  - wordpiece
  - everyones-corpus
tags:
  - korean

KoELECTRA v3 (Base Discriminator)

Table of Contents

  1. Model Details
  2. How To Get Started With the Model
  3. Uses
  4. Limitations
  5. Training
  6. Evaluation Results
  7. Environmental Impact
  8. Citation Information

Model Details

  • Model Description: KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (koelectra-base-v3-discriminator). ELECTRA uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.)
  • Developed by: Jangwon Park
  • Model type:
  • Language(s): Korean
  • License: Apache 2.0
  • Related Models:
  • Resources for more information: For more detail, please see original repository.

How to Get Started with the Model

Load model and tokenizer

>>> from transformers import ElectraModel, ElectraTokenizer

>>> model = ElectraModel.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

Tokenizer example

>>> from transformers import ElectraTokenizer
>>> tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")
>>> tokenizer.tokenize("[CLS] 한국어 ELECTRA를 공유합니다. [SEP]")
['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]']
>>> tokenizer.convert_tokens_to_ids(['[CLS]', '한국어', 'EL', '##EC', '##TRA', '##를', '공유', '##합니다', '.', '[SEP]'])
[2, 11229, 29173, 13352, 25541, 4110, 7824, 17788, 18, 3]

Example using ElectraForPreTraining

import torch
from transformers import ElectraForPreTraining, ElectraTokenizer

discriminator = ElectraForPreTraining.from_pretrained("monologg/koelectra-base-v3-discriminator")
tokenizer = ElectraTokenizer.from_pretrained("monologg/koelectra-base-v3-discriminator")

sentence = "나는 방금 밥을 먹었다."
fake_sentence = "나는 내일 밥을 먹었다."

fake_tokens = tokenizer.tokenize(fake_sentence)
fake_inputs = tokenizer.encode(fake_sentence, return_tensors="pt")

discriminator_outputs = discriminator(fake_inputs)
predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)

print(list(zip(fake_tokens, predictions.tolist()[1:-1])))

Uses

Direct Use

Misuse, Malicious Use, and Out-of-Scope Use

Limitations and Bias

Limitations

Bias

Training

KoELECTRA is trained with 34GB Korean text, KoELECTRA uses Wordpiece and model is uploaded on s3.

Training Data

  • Layers: 12
  • Embedding Size: 768
  • Hidden Size: 768
  • Number of heads: 12

Vocabulary: “WordPiece” vocabulary was used

Vocab-Length Do-Lower-Case
V3 35000 False

For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)

Training Procedure

Pretraining

  • Batch Size: 256
  • Training Steps: 1.5M
  • LR: 2e-4
  • Max Sequence Length: 512
  • Training Time: 14 days

Evaluation

Results

The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base in their git repository

This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.

  • Size: 421M
  • NSMC (acc): 90.63
  • Naver NER (F1): 88.11
  • PAWS (acc): 84.45
  • KorNLI (acc): 82.24
  • KorSTS (spearman): 85.53
  • Question Pair (acc): 95.25
  • KorQuaD (Dev) (EM/F1): 84.83/93.45
  • Korean-Hate-Speech (Dev) (F1): 67.61

KoELECTRA v3 (Base Discriminator) Estimated Emissions

You can estimate carbon emissions using the Machine Learning Impact calculator presented in Lacoste et al. (2019)

  • Hardware Type: TPU v3-8
  • Hours used: 336 hours (14 days)
  • Cloud Provider: GCP (Google Cloud Provider)
  • Compute Region: europe-west4-a
  • Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid): 54.2 kg of CO2eq

Citation

@misc{park2020koelectra,
  author = {Park, Jangwon},
  title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/monologg/KoELECTRA}}
}