krevas's picture
Update README.md
5145d38

Dialog-KoELECTRA

Github : https://github.com/skplanet/Dialog-KoELECTRA

Introduction

Dialog-KoELECTRA is a language model specialized for dialogue. It was trained with 22GB colloquial and written style Korean text data. Dialog-ELECTRA model is made based on the ELECTRA model. ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU.


Released Models

We are initially releasing small version pre-trained model. The model was trained on Korean text. We hope to release other models, such as base/large models, in the future.

Model Layers Hidden Size Params Max
Seq Len
Learning
Rate
Batch Size Train Steps
Dialog-KoELECTRA-Small 12 256 14M 128 1e-4 512 700K

Model Performance

Dialog-KoELECTRA shows strong performance in conversational downstream tasks.

NSMC
(acc)
Question Pair
(acc)
Korean-Hate-Speech
(F1)
Naver NER
(F1)
KorNLI
(acc)
KorSTS
(spearman)
DistilKoBERT 88.60 92.48 60.72 84.65 72.00 72.59
Dialog-KoELECTRA-Small 90.01 94.99 68.26 85.51 78.54 78.96

Train Data

corpus name size
dialog Aihub Korean dialog corpus 7GB
NIKL Spoken corpus
Korean chatbot data
KcBERT
written NIKL Newspaper corpus 15GB
namuwikitext

Vocabulary

We applied morpheme analysis using huggingface_konlpy when creating a vocabulary dictionary. As a result of the experiment, it showed better performance than a vocabulary dictionary created without applying morpheme analysis.

vocabulary size unused token size limit alphabet min frequency
40,000 500 6,000 3