This German BERT is based on
bert-base-german-dbmdz-cased, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the Corpus of German-Language Fiction. Afterwards the model was fine-tuned for named entity recognition on the DROC corpus, so you can use it to recognize protagonists in German novels.
The Corpus of German-Language Fiction consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
After one epoch:
The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels:
The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in Jannidis et al. (2015):
Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, Description of a Corpus of Character References in German Novels, 2018.
Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, Automatische Erkennung von Figuren in deutschsprachigen Romanen, 2015.
Select AutoNLP in the “Train” menu to fine-tune this model automatically.
- Downloads last month