How to use this model directly from the
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("german-nlp-group/electra-base-german-uncased") model = AutoModelWithLMHead.from_pretrained("german-nlp-group/electra-base-german-uncased")
This Model is suitable for Training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).
It can be used as a drop-in Replacement for BERT in most down-stream tasks (ELECTRA is even implemented as an extended BERT Class).
At the time of release (August 2020) this Model is the best performing publicly available German NLP Model on various German Evaluation Metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.
This model has the special feature that it is uncased but does not strip accents. This possibility was added by us with PR #6280. To use it you have to use Transformers version 3.1.0 or newer.
pip install transformers -U
This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.
The special characters 'ö', 'ü', 'ä' are included through the
strip_accent=False option, as this leads to an improved precision.
This model was trained and open sourced in conjunction with the German NLP Group in equal parts by:
|Model Name||F1 macro
|ELECTRA-base-german-uncased (this model)||0.778||0.778||0.00392|
Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.
The sentences were split with SojaMo. We took the German Wikipedia Article Pages Dump 3x to oversample. This approach was also used in a similar way in GPT-3 (Table 2.2).
More Details can be found here Preperaing Datasets for German Electra Github
Because we do not want to stip accents in our training data we made a change to Electra and used this repo Electra no_strip_accents (branch
no_strip_accents). Then created the tf dataset with:
python build_pretraining_dataset.py --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents
The training itself can be performed with the Original Electra Repo (No special case for this needed). We run it with the following Config:
Please Note: Due to the GAN like strucutre of Electra the loss is not that meaningful
It took about 7 Days on a preemtible TPU V3-8. In total, the Model went through approximately 10 Epochs. For an automatically recreation of a cancelled TPUs we used tpunicorn. The total cost of training summed up to about 450 $ for one run. The Data-pre processing and Vocab Creation needed approximately 500-1000 CPU hours. Servers were fully provided by T-Systems on site services GmbH, ambeRoad. Special thanks to Stefan Schweter for your feedback and providing parts of the text corpus.
[¹]: Source for the picture Pinterest
We tried the following approaches which we found had no positive influence: