stefan-it commited on
Commit
80f4495
1 Parent(s): baaa6e7

readme: add initial version

Browse files
README.md ADDED
@@ -0,0 +1,117 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Language Model for Historic Dutch
2
+
3
+ In this repository we open source a language model for Historic Dutch, trained on the
4
+ [Delpher Corpus](https://www.delpher.nl/over-delpher/delpher-open-krantenarchief/download-teksten-kranten-1618-1879\),
5
+ that include digitized texts from Dutch newspapers, ranging from 1618 to 1879.
6
+
7
+ # Changelog
8
+
9
+ * 13.12.2021: Initial version of this repository.
10
+
11
+ # Model Zoo
12
+
13
+ The following models for Historic Dutch are available on the Hugging Face Model Hub:
14
+
15
+ | Model identifier | Model Hub link
16
+ | -------------------------------------- | -------------------------------------------------------------------
17
+ | `dbmdz/bert-base-historic-dutch-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-dutch-cased)
18
+
19
+ # Stats
20
+
21
+ The download urls for all archives can be found [here](delpher-corpus.urls).
22
+
23
+ We then used the awesome `alto-tools` from [this](https://github.com/cneud/alto-tools)
24
+ repository to extract plain text. The following table shows the size overview per year range:
25
+
26
+ | Period | Extracted plain text size
27
+ | --------- | -------------------------:
28
+ | 1618-1699 | 170MB
29
+ | 1700-1709 | 103MB
30
+ | 1710-1719 | 65MB
31
+ | 1720-1729 | 137MB
32
+ | 1730-1739 | 144MB
33
+ | 1740-1749 | 188MB
34
+ | 1750-1759 | 171MB
35
+ | 1760-1769 | 235MB
36
+ | 1770-1779 | 271MB
37
+ | 1780-1789 | 414MB
38
+ | 1790-1799 | 614MB
39
+ | 1800-1809 | 734MB
40
+ | 1810-1819 | 807MB
41
+ | 1820-1829 | 987MB
42
+ | 1830-1839 | 1.7GB
43
+ | 1840-1849 | 2.2GB
44
+ | 1850-1854 | 1.3GB
45
+ | 1855-1859 | 1.7GB
46
+ | 1860-1864 | 2.0GB
47
+ | 1865-1869 | 2.3GB
48
+ | 1870-1874 | 1.9GB
49
+ | 1875-1876 | 867MB
50
+ | 1877-1879 | 1.9GB
51
+
52
+ The total training corpus consists of 427,181,269 sentences and 3,509,581,683 tokens (counted via `wc`),
53
+ resulting in a total corpus size of 21GB.
54
+
55
+ The following figure shows an overview of the number of chars per year distribution:
56
+
57
+ ![Delpher Corpus Stats](/figures/delpher_corpus_stats.png)
58
+
59
+ # Language Model Pretraining
60
+
61
+ We use the official [BERT](https://github.com/google-research/bert) implementation using the following command
62
+ to train the model:
63
+
64
+ ```bash
65
+ python3 run_pretraining.py --input_file gs://delpher-bert/tfrecords/*.tfrecord \
66
+ --output_dir gs://delpher-bert/bert-base-historic-dutch-cased \
67
+ --bert_config_file ./config.json \
68
+ --max_seq_length=512 \
69
+ --max_predictions_per_seq=75 \
70
+ --do_train=True \
71
+ --train_batch_size=128 \
72
+ --num_train_steps=3000000 \
73
+ --learning_rate=1e-4 \
74
+ --save_checkpoints_steps=100000 \
75
+ --keep_checkpoint_max=20 \
76
+ --use_tpu=True \
77
+ --tpu_name=electra-2 \
78
+ --num_tpu_cores=32
79
+ ```
80
+
81
+ We train the model for 3M steps using a total batch size of 128 on a v3-32 TPU. The pretraining loss curve can be seen
82
+ in the next figure:
83
+
84
+ ![Delpher Pretraining Loss Curve](/figures/training_loss.png)
85
+
86
+ # Evaluation
87
+
88
+ We evaluate our model on the preprocessed Europeana NER dataset for Dutch, that was presented in the
89
+ ["Data Centric Domain Adaptation for Historical Text with OCR Errors"](https://github.com/stefan-it/historic-domain-adaptation-icdar) paper.
90
+
91
+ The data is available in their repository. We perform a hyper-parameter search for:
92
+
93
+ * Batch sizes: `[4, 8]`
94
+ * Learning rates: `[3e-5, 5e-5]`
95
+ * Number of epochs: `[5, 10]`
96
+
97
+ and report averaged F1-Score over 5 runs with different seeds. We also include [hmBERT](https://github.com/stefan-it/clef-hipe/blob/main/hlms.md) as baseline model.
98
+
99
+ Results:
100
+
101
+ | Model | F1-Score (Dev / Test)
102
+ | ------------------- | ---------------------
103
+ | hmBERT | (82.73) / 81.34
104
+ | Maerz et al. (2021) | - / 84.2
105
+ | Ours | (89.73) / 87.45
106
+
107
+ # License
108
+
109
+ All models are licensed under [MIT](LICENSE).
110
+
111
+ # Acknowledgments
112
+
113
+ Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
114
+ TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
115
+
116
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
117
+ it is possible to download both cased and uncased models from their S3 storage 🤗
delpher-corpus.urls ADDED
@@ -0,0 +1,23 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_180x.zip
2
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_181x.zip
3
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_174x.zip
4
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1875-6.zip
5
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_175x.zip
6
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_16xx.zip
7
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_170x.zip
8
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1870-4.zip
9
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_183x.zip
10
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_173x.zip
11
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_172x.zip
12
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1855-9.zip
13
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_184x.zip
14
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_176x.zip
15
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1865-9.zip
16
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1860-4.zip
17
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_182x.zip
18
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1877-9.zip
19
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_177x.zip
20
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_1850-4.zip
21
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_178x.zip
22
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_171x.zip
23
+ https://resolver.kb.nl/resolve?urn=DATA:kranten:kranten_pd_179x.zip
figures/delpher_corpus_stats.png ADDED
figures/training_loss.png ADDED