julien-c HF staff commited on
Commit
0d556cb
1 Parent(s): 38e187e

Migrate model card from transformers-repo

Browse files

Read announcement at https://discuss.huggingface.co/t/announcement-all-model-cards-will-be-migrated-to-hf-co-model-repos/2755
Original file history: https://github.com/huggingface/transformers/commits/master/model_cards/severinsimmler/literary-german-bert/README.md

Files changed (1) hide show
  1. README.md +51 -0
README.md ADDED
@@ -0,0 +1,51 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: de
3
+ thumbnail: kfold.png
4
+ ---
5
+
6
+ # German BERT for literary texts
7
+
8
+ This German BERT is based on `bert-base-german-dbmdz-cased`, and has been adapted to the domain of literary texts by fine-tuning the language modeling task on the [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1). Afterwards the model was fine-tuned for named entity recognition on the [DROC](https://gitlab2.informatik.uni-wuerzburg.de/kallimachos/DROC-Release) corpus, so you can use it to recognize protagonists in German novels.
9
+
10
+
11
+ # Stats
12
+
13
+ ## Language modeling
14
+
15
+ The [Corpus of German-Language Fiction](https://figshare.com/articles/Corpus_of_German-Language_Fiction_txt_/4524680/1) consists of 3,194 documents with 203,516,988 tokens or 1,520,855 types. The publication year of the texts ranges from the 18th to the 20th century:
16
+
17
+ ![years](prosa-jahre.png)
18
+
19
+
20
+ ### Results
21
+
22
+ After one epoch:
23
+
24
+ | Model | Perplexity |
25
+ | ---------------- | ---------- |
26
+ | Vanilla BERT | 6.82 |
27
+ | Fine-tuned BERT | 4.98 |
28
+
29
+
30
+ ## Named entity recognition
31
+
32
+ The provided model was also fine-tuned for two epochs on 10,799 sentences for training, validated on 547 and tested on 1,845 with three labels: `B-PER`, `I-PER` and `O`.
33
+
34
+
35
+ ## Results
36
+
37
+ | Dataset | Precision | Recall | F1 |
38
+ | ------- | --------- | ------ | ---- |
39
+ | Dev | 96.4 | 87.3 | 91.6 |
40
+ | Test | 92.8 | 94.9 | 93.8 |
41
+
42
+ The model has also been evaluated using 10-fold cross validation and compared with a classic Conditional Random Field baseline described in [Jannidis et al.](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf) (2015):
43
+
44
+ ![kfold](kfold.png)
45
+
46
+
47
+ # References
48
+
49
+ Markus Krug, Lukas Weimer, Isabella Reger, Luisa Macharowsky, Stephan Feldhaus, Frank Puppe, Fotis Jannidis, [Description of a Corpus of Character References in German Novels](http://webdoc.sub.gwdg.de/pub/mon/dariah-de/dwp-2018-27.pdf), 2018.
50
+
51
+ Fotis Jannidis, Isabella Reger, Lukas Weimer, Markus Krug, Martin Toepfer, Frank Puppe, [Automatische Erkennung von Figuren in deutschsprachigen Romanen](https://opus.bibliothek.uni-wuerzburg.de/opus4-wuerzburg/frontdoor/deliver/index/docId/14333/file/Jannidis_Figurenerkennung_Roman.pdf), 2015.