Github : https://github.com/skplanet/Dialog-KoELECTRA


Dialog-KoELECTRA is a language model specialized for dialogue. It was trained with 22GB colloquial and written style Korean text data. Dialog-ELECTRA model is made based on the ELECTRA model. ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU.

Released Models

We are initially releasing small version pre-trained model. The model was trained on Korean text. We hope to release other models, such as base/large models, in the future.

Model Layers Hidden Size Params Max
Seq Len
Batch Size Train Steps
Dialog-KoELECTRA-Small 12 256 14M 128 1e-4 512 700K

Model Performance

Dialog-KoELECTRA shows strong performance in conversational downstream tasks.

Question Pair
Naver NER
DistilKoBERT 88.60 92.48 60.72 84.65 72.00 72.59
Dialog-KoELECTRA-Small 90.01 94.99 68.26 85.51 78.54 78.96

Train Data

corpus name size
dialog Aihub Korean dialog corpus 7GB
NIKL Spoken corpus
Korean chatbot data
written NIKL Newspaper corpus 15GB


We applied morpheme analysis using huggingface_konlpy when creating a vocabulary dictionary. As a result of the experiment, it showed better performance than a vocabulary dictionary created without applying morpheme analysis.

vocabulary size unused token size limit alphabet min frequency
40,000 500 6,000 3

Downloads last month
Hosted inference API

Unable to determine this model’s pipeline type. Check the docs .