Back to all models
fill-mask mask_token: [MASK]
Query this model
🔥 This model is currently loaded and running on the Inference API. ⚠️ This model could not be loaded by the inference API. ⚠️ This model can be loaded on the Inference API on-demand.
JSON Output
API endpoint  

⚡️ Upgrade your account to access the Inference API

							curl -X POST \
-H "Authorization: Bearer YOUR_ORG_OR_USER_API_TOKEN" \
-H "Content-Type: application/json" \
-d '"json encoded string"' \
Share Copied link to clipboard

Monthly model downloads

german-nlp-group/electra-base-german-uncased german-nlp-group/electra-base-german-uncased
last 30 days



Contributed by

German NLP Group non-profit
3 team members · 1 model

How to use this model directly from the 🤗/transformers library:

Copy to clipboard
from transformers import AutoTokenizer, AutoModelWithLMHead tokenizer = AutoTokenizer.from_pretrained("german-nlp-group/electra-base-german-uncased") model = AutoModelWithLMHead.from_pretrained("german-nlp-group/electra-base-german-uncased")

German Electra Uncased


Model Info

This Model is suitable for Training on many downstream tasks in German (Q&A, Sentiment Analysis, etc.).

It can be used as a drop-in Replacement for BERT in most down-stream tasks (ELECTRA is even implemented as an extended BERT Class).

At the time of release (August 2020) this Model is the best performing publicly available German NLP Model on various German Evaluation Metrics (CONLL03-DE, GermEval18 Coarse, GermEval18 Fine). For GermEval18 Coarse results see below. More will be published soon.


This model has the special feature that it is uncased but does not strip accents. This possibility was added by us with PR #6280. To use it you have to use Transformers version 3.1.0 or newer.

pip install transformers -U

Uncase and Umlauts ('Ö', 'Ä', 'Ü')

This model is uncased. This helps especially for domains where colloquial terms with uncorrect capitalization is often used.

The special characters 'ö', 'ü', 'ä' are included through the strip_accent=False option, as this leads to an improved precision.


This model was trained and open sourced in conjunction with the German NLP Group in equal parts by:

Evaluation: GermEval18 Coarse

Model Name F1 macro
F1 macro
F1 macro
dbmdz-bert-base-german-europeana-cased 0.727 0.729 0.00674
dbmdz-bert-base-german-europeana-uncased 0.736 0.737 0.00476
dbmdz/electra-base-german-europeana-cased-discriminator 0.745 0.745 0.00498
distilbert-base-german-cased 0.752 0.752 0.00341
bert-base-german-cased 0.762 0.761 0.00597
dbmdz/bert-base-german-cased 0.765 0.765 0.00523
dbmdz/bert-base-german-uncased 0.770 0.770 0.00572
ELECTRA-base-german-uncased (this model) 0.778 0.778 0.00392

GermEval18 Coarse Model Evaluation

Checkpoint evaluation

Since it it not guaranteed that the last checkpoint is the best, we evaluated the checkpoints on GermEval18. We found that the last checkpoint is indeed the best. The training was stable and did not overfit the text corpus. Below is a boxplot chart showing the different checkpoints.

Checkpoint Evaluation on GermEval18

Pre-training details


  • Cleaned Common Crawl Corpus 2019-09 German: CC_net (Only head coprus and filtered for language_score > 0.98) - 62 GB
  • German Wikipedia Article Pages Dump (20200701) - 5.5 GB
  • German Wikipedia Talk Pages Dump (20200620) - 1.1 GB
  • Subtitles - 823 MB
  • News 2018 - 4.1 GB

The sentences were split with SojaMo. We took the German Wikipedia Article Pages Dump 3x to oversample. This approach was also used in a similar way in GPT-3 (Table 2.2).

More Details can be found here Preperaing Datasets for German Electra Github

Electra Branch no_strip_accents

Because we do not want to stip accents in our training data we made a change to Electra and used this repo Electra no_strip_accents (branch no_strip_accents). Then created the tf dataset with:

python --corpus-dir <corpus_dir> --vocab-file <dir>/vocab.txt --output-dir ./tf_data --max-seq-length 512 --num-processes 8 --do-lower-case --no-strip-accents

The training

The training itself can be performed with the Original Electra Repo (No special case for this needed). We run it with the following Config:

The exact Training Config
debug False
disallow_correct False
disc_weight 50.0
do_eval False
do_lower_case True
do_train True
electra_objective True
embedding_size 768
eval_batch_size 128
gcp_project None
gen_weight 1.0
generator_hidden_size 0.33333
generator_layers 1.0
iterations_per_loop 200
keep_checkpoint_max 0
learning_rate 0.0002
lr_decay_power 1.0
mask_prob 0.15
max_predictions_per_seq 79
max_seq_length 512
model_dir gs://XXX
model_hparam_overrides {}
model_name 02_Electra_Checkpoints_32k_766k_Combined
model_size base
num_eval_steps 100
num_tpu_cores 8
num_train_steps 766000
num_warmup_steps 10000
pretrain_tfrecords gs://XXX
results_pkl gs://XXX
results_txt gs://XXX
save_checkpoints_steps 5000
temperature 1.0
tpu_job_name None
tpu_name electrav5
tpu_zone None
train_batch_size 256
uniform_generator False
untied_generator True
untied_generator_embeddings False
use_tpu True
vocab_file gs://XXX
vocab_size 32767
weight_decay_rate 0.01

Training Loss

Please Note: Due to the GAN like strucutre of Electra the loss is not that meaningful

It took about 7 Days on a preemtible TPU V3-8. In total, the Model went through approximately 10 Epochs. For an automatically recreation of a cancelled TPUs we used tpunicorn. The total cost of training summed up to about 450 $ for one run. The Data-pre processing and Vocab Creation needed approximately 500-1000 CPU hours. Servers were fully provided by T-Systems on site services GmbH, ambeRoad. Special thanks to Stefan Schweter for your feedback and providing parts of the text corpus.

[¹]: Source for the picture Pinterest

Negative Results

We tried the following approaches which we found had no positive influence:

  • Increased Vocab Size: Leads to more parameters and thus reduced examples/sec while no visible Performance gains were measured
  • Decreased Batch-Size: The original Electra was trained with a Batch Size per TPU Core of 16 whereas this Model was trained with 32 BS / TPU Core. We found out that 32 BS leads to better results when you compare metrics over computation time