Edit model card
YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

TUNiB-Electra

We release several new versions of the ELECTRA model, which we name TUNiB-Electra. There are two motivations. First, all the existing pre-trained Korean encoder models are monolingual, that is, they have knowledge about Korean only. Our bilingual models are based on the balanced corpora of Korean and English. Second, we want new off-the-shelf models trained on much more texts. To this end, we collected a large amount of Korean text from various sources such as blog posts, comments, news, web novels, etc., which sum up to 100 GB in total.

How to use

You can use this model directly with transformers library:

from transformers import AutoModel, AutoTokenizer

# Base Model (Korean-English bilingual model)
tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-base')
model = AutoModel.from_pretrained('tunib/electra-ko-en-base')

Tokenizer example

>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained('tunib/electra-ko-en-base')
>>> tokenizer.tokenize("tunib is a natural language processing tech startup.")
['tun', '##ib', 'is', 'a', 'natural', 'language', 'processing', 'tech', 'startup', '.']
>>> tokenizer.tokenize("νŠœλ‹™μ€ μžμ—°μ–΄μ²˜λ¦¬ ν…Œν¬ μŠ€νƒ€νŠΈμ—…μž…λ‹ˆλ‹€.")
['튜', '##λ‹™', '##은', 'μžμ—°', '##μ–΄', '##처리', 'ν…Œν¬', 'μŠ€νƒ€νŠΈμ—…', '##μž…λ‹ˆλ‹€', '.']

Results on Korean downstream tasks

# Params Avg. NSMC
(acc)
Naver NER
(F1)
PAWS
(acc)
KorNLI
(acc)
KorSTS
(spearman)
Question Pair
(acc)
KorQuaD (Dev)
(EM/F1)
Korean-Hate-Speech (Dev)
(F1)
TUNiB-Electra-ko-base 110M 85.99 90.95 87.63 84.65 82.27 85.00 95.77 64.01 / 90.32 71.40
TUNiB-Electra-ko-en-base 133M 85.34 90.59 87.25 84.90 80.43 83.81 94.85 83.09 / 92.06 68.83
KoELECTRA-base-v3 110M 85.92 90.63 88.11 84.45 82.24 85.53 95.25 84.83 / 93.45 67.61
KcELECTRA-base 124M 84.75 91.71 86.90 74.80 81.65 82.65 95.78 70.60 / 90.11 74.49
KoBERT-base 90M 84.17 89.63 86.11 80.65 79.00 79.64 93.93 52.81 / 80.27 66.21
KcBERT-base 110M 81.37 89.62 84.34 66.95 74.85 75.57 93.93 60.25 / 84.39 68.77
XLM-Roberta-base 280M 85.74 89.49 86.26 82.95 79.92 79.09 93.53 64.70 / 88.94 64.06

Results on English downstream tasks

# Params Avg. CoLA
(MCC)
SST
(Acc)
MRPC
(Acc)
STS
(Spearman)
QQP
(Acc)
MNLI
(Acc)
QNLI
(Acc)
RTE
(Acc)
TUNiB-Electra-ko-en-base 133M 85.2 65.36 92.09 88.97 90.61 90.91 85.32 91.51 76.53
ELECTRA-base 110M 85.7 64.6 96.0 88.1 90.2 89.5 88.5 93.1 75.2
BERT-base 110M 80.8 52.1 93.5 84.8 85.8 89.2 84.6 90.5 66.4
Downloads last month
3,549
Unable to determine this model’s pipeline type. Check the docs .