Dialog-KoELECTRA

Github : https://github.com/skplanet/Dialog-KoELECTRA

Introduction

Dialog-KoELECTRA is a language model specialized for dialogue. It was trained with 22GB colloquial and written style Korean text data. Dialog-ELECTRA model is made based on the ELECTRA model. ELECTRA is a method for self-supervised language representation learning. It can be used to pre-train transformer networks using relatively little compute. ELECTRA models are trained to distinguish "real" input tokens vs "fake" input tokens generated by another neural network, similar to the discriminator of a GAN. At small scale, ELECTRA achieves strong results even when trained on a single GPU.

Released Models

We are initially releasing small version pre-trained model. The model was trained on Korean text. We hope to release other models, such as base/large models, in the future.

Model	Layers	Hidden Size	Params	Max Seq Len	Learning Rate	Batch Size	Train Steps
Dialog-KoELECTRA-Small	12	256	14M	128	1e-4	512	700K

Model Performance

Dialog-KoELECTRA shows strong performance in conversational downstream tasks.

	NSMC (acc)	Question Pair (acc)	Korean-Hate-Speech (F1)	Naver NER (F1)	KorNLI (acc)	KorSTS (spearman)
DistilKoBERT	88.60	92.48	60.72	84.65	72.00	72.59
Dialog-KoELECTRA-Small	90.01	94.99	68.26	85.51	78.54	78.96

Train Data

	corpus name	size
dialog	Aihub Korean dialog corpus	7GB
	NIKL Spoken corpus
	Korean chatbot data
	KcBERT
written	NIKL Newspaper corpus	15GB
written	namuwikitext	15GB

Vocabulary

We applied morpheme analysis using huggingface_konlpy when creating a vocabulary dictionary. As a result of the experiment, it showed better performance than a vocabulary dictionary created without applying morpheme analysis.

vocabulary size	unused token size	limit alphabet	min frequency
40,000	500	6,000	3