PhilipMay commited on
Commit
cf60caf
1 Parent(s): 645b85a

add license details

Browse files
Files changed (1) hide show
  1. README.md +23 -13
README.md CHANGED
@@ -16,17 +16,17 @@ tags:
16
  <img width="300px" src="https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png">
17
  [¹]
18
 
19
- # Version 2 Release
20
  We released an improved version of this model. Version 1 was trained for 766,000 steps. For this new version we continued the training for an additional 734,000 steps. It therefore follows that version 2 was trained on a total of 1,500,000 steps. See "Evaluation of Version 2: GermEval18 Coarse" below for details.
21
 
22
- # Model Info
23
  This Model is suitable for training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).
24
 
25
  It can be used as a drop-in replacement for **BERT** in most down-stream tasks (**ELECTRA** is even implemented as an extended **BERT** Class).
26
 
27
  At the time of release (August 2020) this model is the best performing publicly available German NLP model on various German evaluation metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.
28
 
29
- # Installation
30
  This model has the special feature that it is **uncased** but does **not strip accents**.
31
  This possibility was added by us with [PR #6280](https://github.com/huggingface/transformers/pull/6280).
32
  To use it you have to use Transformers version 3.1.0 or newer.
@@ -35,23 +35,23 @@ To use it you have to use Transformers version 3.1.0 or newer.
35
  pip install transformers -U
36
  ```
37
 
38
- # Uncase and Umlauts ('Ö', 'Ä', 'Ü')
39
  This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.
40
 
41
  The special characters 'ö', 'ü', 'ä' are included through the `strip_accent=False` option, as this leads to an improved precision.
42
 
43
- # Creators
44
  This model was trained and open sourced in conjunction with the [**German NLP Group**](https://github.com/German-NLP-Group) in equal parts by:
45
  - [**Philip May**](https://May.la) - [T-Systems on site services GmbH](https://www.t-systems-onsite.de/)
46
  - [**Philipp Reißel**](https://www.reissel.eu) - [ambeRoad](https://amberoad.de/)
47
 
48
- # Evaluation of Version 2: GermEval18 Coarse
49
  We evaluated all language models on GermEval18 with the F1 macro score. For each model we did an extensive automated hyperparameter search. With the best hyperparmeters we did fit the moodel multiple times on GermEval18. This is done to cancel random effects and get results of statistical relevance.
50
 
51
  ![GermEval18 Coarse Model Evaluation for Version 2](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model-eval-v2.png)
52
 
53
 
54
- # Evaluation: GermEval18 Coarse
55
 
56
  | Model Name | F1 macro<br/>Mean | F1 macro<br/>Median | F1 macro<br/>Std |
57
  |---|---|---|---|
@@ -68,14 +68,14 @@ We evaluated all language models on GermEval18 with the F1 macro score. For each
68
 
69
  ![GermEval18 Coarse Model Evaluation](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model_eval.png)
70
 
71
- # Checkpoint evaluation
72
  Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.
73
 
74
  ![Checkpoint Evaluation on GermEval18](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/checkpoint_eval.png)
75
 
76
- # Pre-training details
77
 
78
- ## Data
79
  - Cleaned Common Crawl Corpus 2019-09 German: [CC_net](https://github.com/facebookresearch/cc_net) (Only head coprus and filtered for language_score > 0.98) - 62 GB
80
  - German Wikipedia Article Pages Dump (20200701) - 5.5 GB
81
  - German Wikipedia Talk Pages Dump (20200620) - 1.1 GB
@@ -86,14 +86,14 @@ The sentences were split with [SojaMo](https://github.com/tsproisl/SoMaJo). We t
86
 
87
  More Details can be found here [Preperaing Datasets for German Electra Github](https://github.com/German-NLP-Group/german-transformer-training)
88
 
89
- ## Electra Branch no_strip_accents
90
  Because we do not want to stip accents in our training data we made a change to Electra and used this repo [Electra no_strip_accents](https://github.com/PhilipMay/electra/tree/no_strip_accents) (branch `no_strip_accents`). Then created the tf dataset with:
91
 
92
  ```bash
93
  python build_pretraining_dataset.py --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
94
  ```
95
 
96
- ## The training
97
  The training itself can be performed with the Original Electra Repo (No special case for this needed).
98
  We run it with the following Config:
99
 
@@ -154,8 +154,18 @@ Special thanks to [Stefan Schweter](https://github.com/stefan-it) for your feedb
154
 
155
  [¹]: Source for the picture [Pinterest](https://www.pinterest.cl/pin/371828512984142193/)
156
 
157
- # Negative Results
158
  We tried the following approaches which we found had no positive influence:
159
 
160
  - **Increased Vocab Size**: Leads to more parameters and thus reduced examples/sec while no visible Performance gains were measured
161
  - **Decreased Batch-Size**: The original Electra was trained with a Batch Size per TPU Core of 16 whereas this Model was trained with 32 BS / TPU Core. We found out that 32 BS leads to better results when you compare metrics over computation time
 
 
 
 
 
 
 
 
 
 
 
16
  <img width="300px" src="https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/german-electra-logo.png">
17
  [¹]
18
 
19
+ ## Version 2 Release
20
  We released an improved version of this model. Version 1 was trained for 766,000 steps. For this new version we continued the training for an additional 734,000 steps. It therefore follows that version 2 was trained on a total of 1,500,000 steps. See "Evaluation of Version 2: GermEval18 Coarse" below for details.
21
 
22
+ ## Model Info
23
  This Model is suitable for training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).
24
 
25
  It can be used as a drop-in replacement for **BERT** in most down-stream tasks (**ELECTRA** is even implemented as an extended **BERT** Class).
26
 
27
  At the time of release (August 2020) this model is the best performing publicly available German NLP model on various German evaluation metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.
28
 
29
+ ## Installation
30
  This model has the special feature that it is **uncased** but does **not strip accents**.
31
  This possibility was added by us with [PR #6280](https://github.com/huggingface/transformers/pull/6280).
32
  To use it you have to use Transformers version 3.1.0 or newer.
 
35
  pip install transformers -U
36
  ```
37
 
38
+ ## Uncase and Umlauts ('Ö', 'Ä', 'Ü')
39
  This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.
40
 
41
  The special characters 'ö', 'ü', 'ä' are included through the `strip_accent=False` option, as this leads to an improved precision.
42
 
43
+ ## Creators
44
  This model was trained and open sourced in conjunction with the [**German NLP Group**](https://github.com/German-NLP-Group) in equal parts by:
45
  - [**Philip May**](https://May.la) - [T-Systems on site services GmbH](https://www.t-systems-onsite.de/)
46
  - [**Philipp Reißel**](https://www.reissel.eu) - [ambeRoad](https://amberoad.de/)
47
 
48
+ ## Evaluation of Version 2: GermEval18 Coarse
49
  We evaluated all language models on GermEval18 with the F1 macro score. For each model we did an extensive automated hyperparameter search. With the best hyperparmeters we did fit the moodel multiple times on GermEval18. This is done to cancel random effects and get results of statistical relevance.
50
 
51
  ![GermEval18 Coarse Model Evaluation for Version 2](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model-eval-v2.png)
52
 
53
 
54
+ ## Evaluation: GermEval18 Coarse
55
 
56
  | Model Name | F1 macro<br/>Mean | F1 macro<br/>Median | F1 macro<br/>Std |
57
  |---|---|---|---|
 
68
 
69
  ![GermEval18 Coarse Model Evaluation](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/model_eval.png)
70
 
71
+ ## Checkpoint evaluation
72
  Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.
73
 
74
  ![Checkpoint Evaluation on GermEval18](https://raw.githubusercontent.com/German-NLP-Group/german-transformer-training/master/model_cards/checkpoint_eval.png)
75
 
76
+ ## Pre-training details
77
 
78
+ ### Data
79
  - Cleaned Common Crawl Corpus 2019-09 German: [CC_net](https://github.com/facebookresearch/cc_net) (Only head coprus and filtered for language_score > 0.98) - 62 GB
80
  - German Wikipedia Article Pages Dump (20200701) - 5.5 GB
81
  - German Wikipedia Talk Pages Dump (20200620) - 1.1 GB
 
86
 
87
  More Details can be found here [Preperaing Datasets for German Electra Github](https://github.com/German-NLP-Group/german-transformer-training)
88
 
89
+ ### Electra Branch no_strip_accents
90
  Because we do not want to stip accents in our training data we made a change to Electra and used this repo [Electra no_strip_accents](https://github.com/PhilipMay/electra/tree/no_strip_accents) (branch `no_strip_accents`). Then created the tf dataset with:
91
 
92
  ```bash
93
  python build_pretraining_dataset.py --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
94
  ```
95
 
96
+ ### The training
97
  The training itself can be performed with the Original Electra Repo (No special case for this needed).
98
  We run it with the following Config:
99
 
 
154
 
155
  [¹]: Source for the picture [Pinterest](https://www.pinterest.cl/pin/371828512984142193/)
156
 
157
+ ### Negative Results
158
  We tried the following approaches which we found had no positive influence:
159
 
160
  - **Increased Vocab Size**: Leads to more parameters and thus reduced examples/sec while no visible Performance gains were measured
161
  - **Decreased Batch-Size**: The original Electra was trained with a Batch Size per TPU Core of 16 whereas this Model was trained with 32 BS / TPU Core. We found out that 32 BS leads to better results when you compare metrics over computation time
162
+
163
+ ## License - The MIT License
164
+ Copyright 2020-2021 Philip May
165
+ Copyright 2020-2021 Philipp Reissel
166
+
167
+ Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
168
+
169
+ The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
170
+
171
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.