Philip May commited on
Commit
645b85a
1 Parent(s): d512dcf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +161 -161
README.md CHANGED
@@ -1,161 +1,161 @@
1
- ---\r
2
- language: de\r
3
- license: mit\r
4
- thumbnail: "https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png"\r
5
- tags:\r
6
- - electra\r
7
- - commoncrawl\r
8
- - uncased\r
9
- - umlaute\r
10
- - umlauts\r
11
- - german\r
12
- - deutsch\r
13
- ---\r
14
- \r
15
- # German Electra Uncased\r
16
- <img width="300px" src="https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png">\r
17
- [¹]\r
18
- \r
19
- # Version 2 Release\r
20
- We released an improved version of this model. Version 1 was trained for 766,000 steps. For this new version we continued the training for an additional 734,000 steps. It therefore follows that version 2 was trained on a total of 1,500,000 steps. See "Evaluation of Version 2: GermEval18 Coarse" below for details.\r
21
- \r
22
- # Model Info\r
23
- This Model is suitable for training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).\r
24
- \r
25
- It can be used as a drop-in replacement for **BERT** in most down-stream tasks (**ELECTRA** is even implemented as an extended **BERT** Class).\r
26
- \r
27
- At the time of release (August 2020) this model is the best performing publicly available German NLP model on various German evaluation metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.\r
28
- \r
29
- # Installation\r
30
- This model has the special feature that it is **uncased** but does **not strip accents**.\r
31
- This possibility was added by us with [PR #6280](https://github.com/huggingface/transformers/pull/6280).\r
32
- To use it you have to use Transformers version 3.1.0 or newer.\r
33
- \r
34
- ```bash\r
35
- pip install transformers -U\r
36
- ```\r
37
- \r
38
- # Uncase and Umlauts ('Ö', 'Ä', 'Ü')\r
39
- This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.\r
40
- \r
41
- The special characters 'ö', 'ü', 'ä' are included through the `strip_accent=False` option, as this leads to an improved precision.\r
42
- \r
43
- # Creators\r
44
- This model was trained and open sourced in conjunction with the [**German NLP Group**](https://github.com/German-NLP-Group) in equal parts by:\r
45
- - [**Philip May**](https://May.la) - [T-Systems on site services GmbH](https://www.t-systems-onsite.de/)\r
46
- - [**Philipp Reißel**](https://www.reissel.eu) - [ambeRoad](https://amberoad.de/)\r
47
- \r
48
- # Evaluation of Version 2: GermEval18 Coarse\r
49
- We evaluated all language models on GermEval18 with the F1 macro score. For each model we did an extensive automated hyperparameter search. With the best hyperparmeters we did fit the moodel multiple times on GermEval18. This is done to cancel random effects and get results of statistical relevance.\r
50
- \r
51
- ![GermEval18 Coarse Model Evaluation for Version 2](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model-eval-v2.png)\r
52
- \r
53
- \r
54
- # Evaluation: GermEval18 Coarse\r
55
- \r
56
- | Model Name | F1 macro<br/>Mean | F1 macro<br/>Median | F1 macro<br/>Std |\r
57
- |---|---|---|---|\r
58
- | dbmdz-bert-base-german-europeana-cased | 0.727 | 0.729 | 0.00674 |\r
59
- | dbmdz-bert-base-german-europeana-uncased | 0.736 | 0.737 | 0.00476 |\r
60
- | dbmdz/electra-base-german-europeana-cased-discriminator | 0.745 | 0.745 | 0.00498 |\r
61
- | distilbert-base-german-cased | 0.752 | 0.752 | 0.00341 |\r
62
- | bert-base-german-cased | 0.762 | 0.761 | 0.00597 |\r
63
- | dbmdz/bert-base-german-cased | 0.765 | 0.765 | 0.00523 |\r
64
- | dbmdz/bert-base-german-uncased | 0.770 | 0.770 | 0.00572 |\r
65
- | **ELECTRA-base-german-uncased (this model)** | **0.778** | **0.778** | **0.00392** |\r
66
- \r
67
- - (1): Hyperparameters taken from the [FARM project](https://farm.deepset.ai/) "[germEval18Coarse_config.json](https://github.com/deepset-ai/FARM/blob/master/experiments/german-bert2.0-eval/germEval18Coarse_config.json)"\r
68
- \r
69
- ![GermEval18 Coarse Model Evaluation](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model_eval.png)\r
70
- \r
71
- # Checkpoint evaluation\r
72
- Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.\r
73
- \r
74
- ![Checkpoint Evaluation on GermEval18](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/checkpoint_eval.png)\r
75
- \r
76
- # Pre-training details\r
77
- \r
78
- ## Data\r
79
- - Cleaned Common Crawl Corpus 2019-09 German: [CC_net](https://github.com/facebookresearch/cc_net) (Only head coprus and filtered for language_score > 0.98) - 62 GB\r
80
- - German Wikipedia Article Pages Dump (20200701) - 5.5 GB\r
81
- - German Wikipedia Talk Pages Dump (20200620) - 1.1 GB\r
82
- - Subtitles - 823 MB\r
83
- - News 2018 - 4.1 GB\r
84
- \r
85
- The sentences were split with [SojaMo](https://github.com/tsproisl/SoMaJo). We took the German Wikipedia Article Pages Dump 3x to oversample. This approach was also used in a similar way in GPT-3 (Table 2.2).\r
86
- \r
87
- More Details can be found here [Preperaing Datasets for German Electra Github](https://github.com/German-NLP-Group/german-transformer-training)\r
88
- \r
89
- ## Electra Branch no_strip_accents\r
90
- Because we do not want to stip accents in our training data we made a change to Electra and used this repo [Electra no_strip_accents](https://github.com/PhilipMay/electra/tree/no_strip_accents) (branch `no_strip_accents`). Then created the tf dataset with:\r
91
- \r
92
- ```bash\r
93
- python build_pretraining_dataset.py --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents\r
94
- ```\r
95
- \r
96
- ## The training\r
97
- The training itself can be performed with the Original Electra Repo (No special case for this needed).\r
98
- We run it with the following Config:\r
99
- \r
100
- <details>\r
101
- <summary>The exact Training Config</summary>\r
102
- <br/>debug False\r
103
- <br/>disallow_correct False\r
104
- <br/>disc_weight 50.0\r
105
- <br/>do_eval False\r
106
- <br/>do_lower_case True\r
107
- <br/>do_train True\r
108
- <br/>electra_objective True\r
109
- <br/>embedding_size 768\r
110
- <br/>eval_batch_size 128\r
111
- <br/>gcp_project None\r
112
- <br/>gen_weight 1.0\r
113
- <br/>generator_hidden_size 0.33333\r
114
- <br/>generator_layers 1.0\r
115
- <br/>iterations_per_loop 200\r
116
- <br/>keep_checkpoint_max 0\r
117
- <br/>learning_rate 0.0002\r
118
- <br/>lr_decay_power 1.0\r
119
- <br/>mask_prob 0.15\r
120
- <br/>max_predictions_per_seq 79\r
121
- <br/>max_seq_length 512\r
122
- <br/>model_dir gs://XXX\r
123
- <br/>model_hparam_overrides {}\r
124
- <br/>model_name 02_Electra_Checkpoints_32k_766k_Combined\r
125
- <br/>model_size base\r
126
- <br/>num_eval_steps 100\r
127
- <br/>num_tpu_cores 8\r
128
- <br/>num_train_steps 766000\r
129
- <br/>num_warmup_steps 10000\r
130
- <br/>pretrain_tfrecords gs://XXX\r
131
- <br/>results_pkl gs://XXX\r
132
- <br/>results_txt gs://XXX\r
133
- <br/>save_checkpoints_steps 5000\r
134
- <br/>temperature 1.0\r
135
- <br/>tpu_job_name None\r
136
- <br/>tpu_name electrav5\r
137
- <br/>tpu_zone None\r
138
- <br/>train_batch_size 256\r
139
- <br/>uniform_generator False\r
140
- <br/>untied_generator True\r
141
- <br/>untied_generator_embeddings False\r
142
- <br/>use_tpu True\r
143
- <br/>vocab_file gs://XXX\r
144
- <br/>vocab_size 32767\r
145
- <br/>weight_decay_rate 0.01\r
146
- </details>\r
147
- \r
148
- ![Training Loss](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/loss.png)\r
149
- \r
150
- Please Note: *Due to the GAN like strucutre of Electra the loss is not that meaningful*\r
151
- \r
152
- It took about 7 Days on a preemtible TPU V3-8. In total, the Model went through approximately 10 Epochs. For an automatically recreation of a cancelled TPUs we used [tpunicorn](https://github.com/shawwn/tpunicorn). The total cost of training summed up to about 450 $ for one run. The Data-pre processing and Vocab Creation needed approximately 500-1000 CPU hours. Servers were fully provided by [T-Systems on site services GmbH](https://www.t-systems-onsite.de/), [ambeRoad](https://amberoad.de/).\r
153
- Special thanks to [Stefan Schweter](https://github.com/stefan-it) for your feedback and providing parts of the text corpus.\r
154
- \r
155
- [¹]: Source for the picture [Pinterest](https://www.pinterest.cl/pin/371828512984142193/)\r
156
- \r
157
- # Negative Results\r
158
- We tried the following approaches which we found had no positive influence:\r
159
- \r
160
- - **Increased Vocab Size**: Leads to more parameters and thus reduced examples/sec while no visible Performance gains were measured\r
161
- - **Decreased Batch-Size**: The original Electra was trained with a Batch Size per TPU Core of 16 whereas this Model was trained with 32 BS / TPU Core. We found out that 32 BS leads to better results when you compare metrics over computation time\r
 
1
+ ---
2
+ language: de
3
+ license: mit
4
+ thumbnail: "https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png"
5
+ tags:
6
+ - electra
7
+ - commoncrawl
8
+ - uncased
9
+ - umlaute
10
+ - umlauts
11
+ - german
12
+ - deutsch
13
+ ---
14
+
15
+ # German Electra Uncased
16
+ <img width="300px" src="https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png">
17
+ [¹]
18
+
19
+ # Version 2 Release
20
+ We released an improved version of this model. Version 1 was trained for 766,000 steps. For this new version we continued the training for an additional 734,000 steps. It therefore follows that version 2 was trained on a total of 1,500,000 steps. See "Evaluation of Version 2: GermEval18 Coarse" below for details.
21
+
22
+ # Model Info
23
+ This Model is suitable for training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).
24
+
25
+ It can be used as a drop-in replacement for **BERT** in most down-stream tasks (**ELECTRA** is even implemented as an extended **BERT** Class).
26
+
27
+ At the time of release (August 2020) this model is the best performing publicly available German NLP model on various German evaluation metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.
28
+
29
+ # Installation
30
+ This model has the special feature that it is **uncased** but does **not strip accents**.
31
+ This possibility was added by us with [PR #6280](https://github.com/huggingface/transformers/pull/6280).
32
+ To use it you have to use Transformers version 3.1.0 or newer.
33
+
34
+ ```bash
35
+ pip install transformers -U
36
+ ```
37
+
38
+ # Uncase and Umlauts ('Ö', 'Ä', 'Ü')
39
+ This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.
40
+
41
+ The special characters 'ö', 'ü', 'ä' are included through the `strip_accent=False` option, as this leads to an improved precision.
42
+
43
+ # Creators
44
+ This model was trained and open sourced in conjunction with the [**German NLP Group**](https://github.com/German-NLP-Group) in equal parts by:
45
+ - [**Philip May**](https://May.la) - [T-Systems on site services GmbH](https://www.t-systems-onsite.de/)
46
+ - [**Philipp Reißel**](https://www.reissel.eu) - [ambeRoad](https://amberoad.de/)
47
+
48
+ # Evaluation of Version 2: GermEval18 Coarse
49
+ We evaluated all language models on GermEval18 with the F1 macro score. For each model we did an extensive automated hyperparameter search. With the best hyperparmeters we did fit the moodel multiple times on GermEval18. This is done to cancel random effects and get results of statistical relevance.
50
+
51
+ ![GermEval18 Coarse Model Evaluation for Version 2](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model-eval-v2.png)
52
+
53
+
54
+ # Evaluation: GermEval18 Coarse
55
+
56
+ | Model Name | F1 macro<br/>Mean | F1 macro<br/>Median | F1 macro<br/>Std |
57
+ |---|---|---|---|
58
+ | dbmdz-bert-base-german-europeana-cased | 0.727 | 0.729 | 0.00674 |
59
+ | dbmdz-bert-base-german-europeana-uncased | 0.736 | 0.737 | 0.00476 |
60
+ | dbmdz/electra-base-german-europeana-cased-discriminator | 0.745 | 0.745 | 0.00498 |
61
+ | distilbert-base-german-cased | 0.752 | 0.752 | 0.00341 |
62
+ | bert-base-german-cased | 0.762 | 0.761 | 0.00597 |
63
+ | dbmdz/bert-base-german-cased | 0.765 | 0.765 | 0.00523 |
64
+ | dbmdz/bert-base-german-uncased | 0.770 | 0.770 | 0.00572 |
65
+ | **ELECTRA-base-german-uncased (this model)** | **0.778** | **0.778** | **0.00392** |
66
+
67
+ - (1): Hyperparameters taken from the [FARM project](https://farm.deepset.ai/) "[germEval18Coarse_config.json](https://github.com/deepset-ai/FARM/blob/master/experiments/german-bert2.0-eval/germEval18Coarse_config.json)"
68
+
69
+ ![GermEval18 Coarse Model Evaluation](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model_eval.png)
70
+
71
+ # Checkpoint evaluation
72
+ Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.
73
+
74
+ ![Checkpoint Evaluation on GermEval18](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/checkpoint_eval.png)
75
+
76
+ # Pre-training details
77
+
78
+ ## Data
79
+ - Cleaned Common Crawl Corpus 2019-09 German: [CC_net](https://github.com/facebookresearch/cc_net) (Only head coprus and filtered for language_score > 0.98) - 62 GB
80
+ - German Wikipedia Article Pages Dump (20200701) - 5.5 GB
81
+ - German Wikipedia Talk Pages Dump (20200620) - 1.1 GB
82
+ - Subtitles - 823 MB
83
+ - News 2018 - 4.1 GB
84
+
85
+ The sentences were split with [SojaMo](https://github.com/tsproisl/SoMaJo). We took the German Wikipedia Article Pages Dump 3x to oversample. This approach was also used in a similar way in GPT-3 (Table 2.2).
86
+
87
+ More Details can be found here [Preperaing Datasets for German Electra Github](https://github.com/German-NLP-Group/german-transformer-training)
88
+
89
+ ## Electra Branch no_strip_accents
90
+ Because we do not want to stip accents in our training data we made a change to Electra and used this repo [Electra no_strip_accents](https://github.com/PhilipMay/electra/tree/no_strip_accents) (branch `no_strip_accents`). Then created the tf dataset with:
91
+
92
+ ```bash
93
+ python build_pretraining_dataset.py --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
94
+ ```
95
+
96
+ ## The training
97
+ The training itself can be performed with the Original Electra Repo (No special case for this needed).
98
+ We run it with the following Config:
99
+
100
+ <details>
101
+ <summary>The exact Training Config</summary>
102
+ <br/>debug False
103
+ <br/>disallow_correct False
104
+ <br/>disc_weight 50.0
105
+ <br/>do_eval False
106
+ <br/>do_lower_case True
107
+ <br/>do_train True
108
+ <br/>electra_objective True
109
+ <br/>embedding_size 768
110
+ <br/>eval_batch_size 128
111
+ <br/>gcp_project None
112
+ <br/>gen_weight 1.0
113
+ <br/>generator_hidden_size 0.33333
114
+ <br/>generator_layers 1.0
115
+ <br/>iterations_per_loop 200
116
+ <br/>keep_checkpoint_max 0
117
+ <br/>learning_rate 0.0002
118
+ <br/>lr_decay_power 1.0
119
+ <br/>mask_prob 0.15
120
+ <br/>max_predictions_per_seq 79
121
+ <br/>max_seq_length 512
122
+ <br/>model_dir gs://XXX
123
+ <br/>model_hparam_overrides {}
124
+ <br/>model_name 02_Electra_Checkpoints_32k_766k_Combined
125
+ <br/>model_size base
126
+ <br/>num_eval_steps 100
127
+ <br/>num_tpu_cores 8
128
+ <br/>num_train_steps 766000
129
+ <br/>num_warmup_steps 10000
130
+ <br/>pretrain_tfrecords gs://XXX
131
+ <br/>results_pkl gs://XXX
132
+ <br/>results_txt gs://XXX
133
+ <br/>save_checkpoints_steps 5000
134
+ <br/>temperature 1.0
135
+ <br/>tpu_job_name None
136
+ <br/>tpu_name electrav5
137
+ <br/>tpu_zone None
138
+ <br/>train_batch_size 256
139
+ <br/>uniform_generator False
140
+ <br/>untied_generator True
141
+ <br/>untied_generator_embeddings False
142
+ <br/>use_tpu True
143
+ <br/>vocab_file gs://XXX
144
+ <br/>vocab_size 32767
145
+ <br/>weight_decay_rate 0.01
146
+ </details>
147
+
148
+ ![Training Loss](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/loss.png)
149
+
150
+ Please Note: *Due to the GAN like strucutre of Electra the loss is not that meaningful*
151
+
152
+ It took about 7 Days on a preemtible TPU V3-8. In total, the Model went through approximately 10 Epochs. For an automatically recreation of a cancelled TPUs we used [tpunicorn](https://github.com/shawwn/tpunicorn). The total cost of training summed up to about 450 $ for one run. The Data-pre processing and Vocab Creation needed approximately 500-1000 CPU hours. Servers were fully provided by [T-Systems on site services GmbH](https://www.t-systems-onsite.de/), [ambeRoad](https://amberoad.de/).
153
+ Special thanks to [Stefan Schweter](https://github.com/stefan-it) for your feedback and providing parts of the text corpus.
154
+
155
+ [¹]: Source for the picture [Pinterest](https://www.pinterest.cl/pin/371828512984142193/)
156
+
157
+ # Negative Results
158
+ We tried the following approaches which we found had no positive influence:
159
+
160
+ - **Increased Vocab Size**: Leads to more parameters and thus reduced examples/sec while no visible Performance gains were measured
161
+ - **Decreased Batch-Size**: The original Electra was trained with a Batch Size per TPU Core of 16 whereas this Model was trained with 32 BS / TPU Core. We found out that 32 BS leads to better results when you compare metrics over computation time