snunlp commited on
Commit
f91edc3
1 Parent(s): 2423d4d

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +82 -0
README.md ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## KoRean based ELECTRA (KR-ELECTRA)
2
+
3
+ This is a release of a Korean-specific ELECTRA model with comparable or better performances developed by the Computational Linguistics Lab at Seoul National University. Our model shows remarkable performances on tasks related to informal texts such as review documents, while still showing comparable results on other kinds of tasks.
4
+
5
+ ### Released Model
6
+ We pre-trained our KR-ELECTRA model following a base-scale model of [ELECTRA](https://github.com/google-research/electra). We trained the model based on Tensorflow-v1 using a v3-8 TPU of Google Cloud Platform.
7
+
8
+ #### Model Details
9
+
10
+ We followed the training parameters of the base-scale model of [ELECTRA](https://github.com/google-research/electra).
11
+
12
+ ##### Hyperparameters
13
+
14
+ | model | # of layers | embedding size | hidden size | # of heads |
15
+ | ------: | ----------: | -------------: | ----------: | ---------: |
16
+ | Discriminator | 12 | 768 | 768 | 12 |
17
+ | Generator | 12 | 768 | 256 | 4 |
18
+
19
+
20
+ ##### Pretraining
21
+
22
+ | batch size | train steps | learning rates | max sequence length | generator size |
23
+ | ---------: | ----------: | -------------: | ------------------: | -------------: |
24
+ | 256 | 700000 | 2e-4 | 128 | 0.33333 |
25
+
26
+
27
+ #### Training Dataset
28
+
29
+ 34GB Korean texts including Wikipedia documents, news articles, legal texts, news comments, product reviews, and so on. These texts are balanced, consisting of the same ratios of written and spoken data.
30
+
31
+
32
+ #### Vocabulary
33
+
34
+ vocab size 30,000
35
+ We used morpheme-based unit tokens for our vocabulary based on the Mecab-Ko[https://bitbucket.org/eunjeon/mecab-ko-dic/src/master/] morpheme analyzer.
36
+
37
+
38
+ #### Download Link
39
+
40
+ * Tensorflow-v1 model ([download](https://drive.google.com/file/d/1L_yKEDaXM_yDLwHm5QrXAncQZiMN3BBU/view?usp=sharing))
41
+
42
+ * PyTorch models on HuggingFace
43
+
44
+ ```
45
+ from transformers import ElectraModel, ElectraTokenizer
46
+
47
+ model = ElectraModel.from_pretrained("snunlp/KR-ELECTRA-discriminator")
48
+ tokenizer = ElectraTokenizer.from_pretrained("snunlp/KR-ELECTRA-discriminator")
49
+ ```
50
+
51
+
52
+ ### Finetuning
53
+
54
+ We used and slightly edited the finetuning codes from [KoELECTRA](https://github.com/monologg/KoELECTRA), with additionally adjusted hyperparameters. You can download the codes and config files that we used for our model.
55
+
56
+
57
+ #### Experimental Results
58
+
59
+ | | **NSMC**<br/>(acc) | **Naver NER**<br/>(F1) | **PAWS**<br/>(acc) | **KorNLI**<br/>(acc) | **KorSTS**<br/>(spearman) | **Question Pair**<br/>(acc) | **KorQuaD (Dev)**<br/>(EM/F1) | **Korean-Hate-Speech (Dev)**<br/>(F1) |
60
+ | :-------------------- | :----------------: | :--------------------: | :----------------: | :------------------: | :-----------------------: | :-------------------------: | :---------------------------: | :-----------------------------------: |
61
+ | KoBERT | 89.59 | 87.92 | 81.25 | 79.62 | 81.59 | 94.85 | 51.75 / 79.15 | 66.21 |
62
+ | XLM-Roberta-Base | 89.03 | 86.65 | 82.80 | 80.23 | 78.45 | 93.80 | 64.70 / 88.94 | 64.06 |
63
+ | HanBERT | 90.06 | 87.70 | 82.95 | 80.32 | 82.73 | 94.72 | 78.74 / 92.02 | **68.32** |
64
+ | KoELECTRA-Base | 90.33 | 87.18 | 81.70 | 80.64 | 82.00 | 93.54 | 60.86 / 89.28 | 66.09 |
65
+ | KoELECTRA-Base-v2 | 89.56 | 87.16 | 80.70 | 80.72 | 82.30 | 94.85 | 84.01 / 92.40 | 67.45 |
66
+ | KoELECTRA-Base-v3 | 90.63 | **88.11** | **84.45** | 82.24 | **85.53** | 95.25 | 84.83 / **93.45** | 67.61 |
67
+ | **KR-ELECTRA (ours)** | **91.168** | 87.90 | 82.05 | **82.51** | 85.41 | **95.51** | **84.93** / 93.04 | **74.50** |
68
+
69
+ The baseline results are brought from [KoELECTRA](https://github.com/monologg/KoELECTRA)'s.
70
+
71
+
72
+ ### Citation
73
+ ```bibtex
74
+ @misc{kr-electra,
75
+ author = {Lee, Sangah and Hyopil Shin},
76
+ title = {KR-ELECTRA: a KoRean-based ELECTRA model},
77
+ year = {2022},
78
+ publisher = {GitHub},
79
+ journal = {GitHub repository},
80
+ howpublished = {\url{https://github.com/snunlp/KR-ELECTRA}}
81
+ }
82
+ ```