Ezi commited on
Commit
da54619
1 Parent(s): 68b30cd

Update README.md

Browse files

Hi! This PR has some optional additions to your model card, based on the format we are using as part of our effort to standardise model cards at Hugging Face. Your additional input regarding KoELECTRA v3 uses( direct and indirect - Misuse, Malicious Use, and Out-of-Scope Use) would be appreciated.
Feel free to merge if you are ok with the changes! (cc

@Marissa



@Meg

)

Files changed (1) hide show
  1. README.md +115 -4
README.md CHANGED
@@ -1,17 +1,38 @@
1
  ---
2
  language: ko
3
  license: apache-2.0
 
 
 
4
  tags:
5
  - korean
6
  ---
7
 
8
- # KoELECTRA v3 (Base Discriminator)
9
 
10
- Pretrained ELECTRA Language Model for Korean (`koelectra-base-v3-discriminator`)
11
 
12
- For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md).
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
13
 
14
- ## Usage
15
 
16
  ### Load model and tokenizer
17
 
@@ -53,3 +74,93 @@ predictions = torch.round((torch.sign(discriminator_outputs[0]) + 1) / 2)
53
 
54
  print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
55
  ```
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  language: ko
3
  license: apache-2.0
4
+ datasets:
5
+ - wordpiece
6
+ - everyones-corpus
7
  tags:
8
  - korean
9
  ---
10
 
 
11
 
12
+ # KoELECTRA v3 (Base Discriminator)
13
 
14
+ ## Table of Contents
15
+ 1. [Model Details](#model-details)
16
+ 2. [How To Get Started With the Model](#how-to-get-started-with-the-model)
17
+ 3. [Uses](#uses)
18
+ 4. [Limitations](#limitations)
19
+ 5. [Training](#training)
20
+ 6. [Evaluation Results](#evaluation-results)
21
+ 7. [Environmental Impact](#environmental-impact)
22
+ 8. [Citation Information](#citation-information)
23
+
24
+ ## Model Details
25
+ * **Model Description:**
26
+ KoELECTRA v3 (Base Discriminator) is a pretrained ELECTRA Language Model for Korean (`koelectra-base-v3-discriminator`). [ELECTRA](https://openreview.net/pdf?id=r1xMH1BtvB) uses Replaced Token Detection, in other words, it learns by looking at the token from the generator and determining whether it is a "real" token or a "fake" token in the discriminator. This methods allows to train all input tokens, which shows competitive result compare to other pretrained language models (BERT etc.)
27
+ * **Developed by:** Jangwon Park
28
+ * **Model type:**
29
+ * **Language(s):** Korean
30
+ * **License:** Apache 2.0
31
+ * **Related Models:**
32
+ * **Resources for more information:** For more detail, please see [original repository](https://github.com/monologg/KoELECTRA/blob/master/README_EN.md).
33
+
34
+ ## How to Get Started with the Model
35
 
 
36
 
37
  ### Load model and tokenizer
38
 
 
74
 
75
  print(list(zip(fake_tokens, predictions.tolist()[1:-1])))
76
  ```
77
+
78
+
79
+ ## Uses
80
+
81
+ #### Direct Use
82
+
83
+ #### Misuse, Malicious Use, and Out-of-Scope Use
84
+
85
+
86
+
87
+ ## Limitations and Bias
88
+
89
+ #### Limitations
90
+
91
+ #### Bias
92
+
93
+
94
+ ## Training
95
+ KoELECTRA is trained with 34GB Korean text,
96
+ KoELECTRA uses [Wordpiece](https://github.com/monologg/KoELECTRA/blob/master/docs/wordpiece_vocab_EN.md) and model is uploaded on s3.
97
+
98
+ ### Training Data
99
+
100
+ * **Layers:** 12
101
+ * **Embedding Size:** 768
102
+ * **Hidden Size:** 768
103
+ * **Number of heads:** 12
104
+
105
+ Vocabulary: “WordPiece” vocabulary was used
106
+
107
+ | | Vocab-Length | Do-Lower-Case |
108
+ |:-:|:-------------:|:----------------:|
109
+ |V3 | 35000 | False |
110
+
111
+ For v3, 20G Corpus from Everyone's Corpus was additionally used. (Newspaper, written, spoken, messenger, web)
112
+
113
+ ### Training Procedure
114
+
115
+ #### Pretraining
116
+
117
+ * **Batch Size:** 256
118
+ * **Training Steps:** 1.5M
119
+ * **LR:** 2e-4
120
+ * **Max Sequence Length:** 512
121
+ * **Training Time:** 14 days
122
+
123
+
124
+ ## Evaluation
125
+
126
+
127
+ #### Results
128
+ The model developer discusses the fine tuning results for the v3 in comparison to other base models e.g XLM-Roberta-Base [in their git repository](https://github.com/monologg/KoELECTRA/blob/master/finetune/README_EN.md)
129
+
130
+ This is the result of running with the config as it is, and if hyperparameter tuning is additionally performed, better performance may come out.
131
+
132
+ * **Size:** 421M
133
+ * **NSMC (acc):** 90.63
134
+ * **Naver NER (F1):** 88.11
135
+ * **PAWS (acc):** 84.45
136
+ * **KorNLI (acc):** 82.24
137
+ * **KorSTS (spearman):** 85.53
138
+ * **Question Pair (acc):** 95.25
139
+ * **KorQuaD (Dev) (EM/F1):** 84.83/93.45
140
+ * **Korean-Hate-Speech (Dev) (F1):** 67.61
141
+
142
+
143
+ ### KoELECTRA v3 (Base Discriminator) Estimated Emissions
144
+
145
+ You can estimate carbon emissions using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700)
146
+
147
+ * **Hardware Type:** TPU v3-8
148
+ * **Hours used:** 336 hours (14 days)
149
+ * **Cloud Provider:** GCP (Google Cloud Provider)
150
+ * **Compute Region:** europe-west4-a
151
+ * **Carbon Emitted (Power consumption x Time x Carbon produced based on location of power grid):** 54.2 kg of CO2eq
152
+
153
+
154
+ ## Citation
155
+ ```bibtext
156
+ @misc{park2020koelectra,
157
+ author = {Park, Jangwon},
158
+ title = {KoELECTRA: Pretrained ELECTRA Model for Korean},
159
+ year = {2020},
160
+ publisher = {GitHub},
161
+ journal = {GitHub repository},
162
+ howpublished = {\url{https://github.com/monologg/KoELECTRA}}
163
+ }
164
+
165
+ ```
166
+