stefan-it commited on
Commit
04297ad
1 Parent(s): 8059133

readme: add initial version

Browse files
Files changed (1) hide show
  1. README.md +346 -0
README.md ADDED
@@ -0,0 +1,346 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: multilingual
3
+ license: mit
4
+ widget:
5
+ - text: "and I cannot conceive the reafon why [MASK] hath"
6
+ - text: "Täkäläinen sanomalehdistö [MASK] erit - täin"
7
+ - text: "Det vore [MASK] häller nödvändigt att be"
8
+ - text: "Comme, à cette époque [MASK] était celle de la"
9
+ - text: "In [MASK] an atmosphärischen Nahrungsmitteln"
10
+ ---
11
+
12
+ # Historic Language Models (HLMs)
13
+
14
+ ## Languages
15
+
16
+ Our Historic Language Models Zoo contains support for the following languages - incl. their training data source:
17
+
18
+ | Language | Training data | Size
19
+ | -------- | ------------- | ----
20
+ | German | [Europeana](http://www.europeana-newspapers.eu/) | 13-28GB (filtered)
21
+ | French | [Europeana](http://www.europeana-newspapers.eu/) | 11-31GB (filtered)
22
+ | English | [British Library](https://data.bl.uk/digbks/db14.html) | 24GB (year filtered)
23
+ | Finnish | [Europeana](http://www.europeana-newspapers.eu/) | 1.2GB
24
+ | Swedish | [Europeana](http://www.europeana-newspapers.eu/) | 1.1GB
25
+
26
+ ## Models
27
+
28
+ At the moment, the following models are available on the model hub:
29
+
30
+ | Model identifier | Model Hub link
31
+ | --------------------------------------------- | --------------------------------------------------------------------------
32
+ | `dbmdz/bert-base-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased)
33
+ | `dbmdz/bert-base-historic-english-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-english-cased)
34
+ | `dbmdz/bert-base-finnish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-finnish-europeana-cased)
35
+ | `dbmdz/bert-base-swedish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-swedish-europeana-cased)
36
+
37
+ We also released smaller models for the multilingual model:
38
+
39
+ | Model identifier | Model Hub link
40
+ | ----------------------------------------------- | ---------------------------------------------------------------------------
41
+ | `dbmdz/bert-tiny-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-tiny-historic-multilingual-cased)
42
+ | `dbmdz/bert-mini-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-mini-historic-multilingual-cased)
43
+ | `dbmdz/bert-small-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-small-historic-multilingual-cased)
44
+ | `dbmdz/bert-medium-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased)
45
+
46
+ **Notice**: We have released language models for Historic German and French trained on more noisier data earlier - see
47
+ [this repo](https://github.com/stefan-it/europeana-bert) for more information:
48
+
49
+ | Model identifier | Model Hub link
50
+ | --------------------------------------------- | --------------------------------------------------------------------------
51
+ | `dbmdz/bert-base-german-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-german-europeana-cased)
52
+ | `dbmdz/bert-base-french-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-french-europeana-cased)
53
+
54
+ # Corpora Stats
55
+
56
+ ## German Europeana Corpus
57
+
58
+ We provide some statistics using different thresholds of ocr confidences, in order to shrink down the corpus size
59
+ and use less-noisier data:
60
+
61
+ | OCR confidence | Size
62
+ | -------------- | ----
63
+ | **0.60** | 28GB
64
+ | 0.65 | 18GB
65
+ | 0.70 | 13GB
66
+
67
+ For the final corpus we use a OCR confidence of 0.6 (28GB). The following plot shows a tokens per year distribution:
68
+
69
+ ![German Europeana Corpus Stats](stats/figures/german_europeana_corpus_stats.png)
70
+
71
+ ## French Europeana Corpus
72
+
73
+ Like German, we use different ocr confidence thresholds:
74
+
75
+ | OCR confidence | Size
76
+ | -------------- | ----
77
+ | 0.60 | 31GB
78
+ | 0.65 | 27GB
79
+ | **0.70** | 27GB
80
+ | 0.75 | 23GB
81
+ | 0.80 | 11GB
82
+
83
+ For the final corpus we use a OCR confidence of 0.7 (27GB). The following plot shows a tokens per year distribution:
84
+
85
+ ![French Europeana Corpus Stats](stats/figures/french_europeana_corpus_stats.png)
86
+
87
+ ## British Library Corpus
88
+
89
+ Metadata is taken from [here](https://data.bl.uk/digbks/DB21.html). Stats incl. year filtering:
90
+
91
+ | Years | Size
92
+ | ----------------- | ----
93
+ | ALL | 24GB
94
+ | >= 1800 && < 1900 | 24GB
95
+
96
+ We use the year filtered variant. The following plot shows a tokens per year distribution:
97
+
98
+ ![British Library Corpus Stats](stats/figures/bl_corpus_stats.png)
99
+
100
+ ## Finnish Europeana Corpus
101
+
102
+ | OCR confidence | Size
103
+ | -------------- | ----
104
+ | 0.60 | 1.2GB
105
+
106
+ The following plot shows a tokens per year distribution:
107
+
108
+ ![Finnish Europeana Corpus Stats](stats/figures/finnish_europeana_corpus_stats.png)
109
+
110
+ ## Swedish Europeana Corpus
111
+
112
+ | OCR confidence | Size
113
+ | -------------- | ----
114
+ | 0.60 | 1.1GB
115
+
116
+ The following plot shows a tokens per year distribution:
117
+
118
+ ![Swedish Europeana Corpus Stats](stats/figures/swedish_europeana_corpus_stats.png)
119
+
120
+ ## All Corpora
121
+
122
+ The following plot shows a tokens per year distribution of the complete training corpus:
123
+
124
+ ![All Corpora Stats](stats/figures/all_corpus_stats.png)
125
+
126
+ # Multilingual Vocab generation
127
+
128
+ For the first attempt, we use the first 10GB of each pretraining corpus. We upsample both Finnish and Swedish to ~10GB.
129
+ The following tables shows the exact size that is used for generating a 32k and 64k subword vocabs:
130
+
131
+ | Language | Size
132
+ | -------- | ----
133
+ | German | 10GB
134
+ | French | 10GB
135
+ | English | 10GB
136
+ | Finnish | 9.5GB
137
+ | Swedish | 9.7GB
138
+
139
+ We then calculate the subword fertility rate and portion of `[UNK]`s over the following NER corpora:
140
+
141
+ | Language | NER corpora
142
+ | -------- | ------------------
143
+ | German | CLEF-HIPE, NewsEye
144
+ | French | CLEF-HIPE, NewsEye
145
+ | English | CLEF-HIPE
146
+ | Finnish | NewsEye
147
+ | Swedish | NewsEye
148
+
149
+ Breakdown of subword fertility rate and unknown portion per language for the 32k vocab:
150
+
151
+ | Language | Subword fertility | Unknown portion
152
+ | -------- | ------------------ | ---------------
153
+ | German | 1.43 | 0.0004
154
+ | French | 1.25 | 0.0001
155
+ | English | 1.25 | 0.0
156
+ | Finnish | 1.69 | 0.0007
157
+ | Swedish | 1.43 | 0.0
158
+
159
+ Breakdown of subword fertility rate and unknown portion per language for the 64k vocab:
160
+
161
+ | Language | Subword fertility | Unknown portion
162
+ | -------- | ------------------ | ---------------
163
+ | German | 1.31 | 0.0004
164
+ | French | 1.16 | 0.0001
165
+ | English | 1.17 | 0.0
166
+ | Finnish | 1.54 | 0.0007
167
+ | Swedish | 1.32 | 0.0
168
+
169
+ # Final pretraining corpora
170
+
171
+ We upsample Swedish and Finnish to ~27GB. The final stats for all pretraining corpora can be seen here:
172
+
173
+ | Language | Size
174
+ | -------- | ----
175
+ | German | 28GB
176
+ | French | 27GB
177
+ | English | 24GB
178
+ | Finnish | 27GB
179
+ | Swedish | 27GB
180
+
181
+ Total size is 130GB.
182
+
183
+ # Smaller multilingual models
184
+
185
+ Inspired by the ["Well-Read Students Learn Better: On the Importance of Pre-training Compact Models"](https://arxiv.org/abs/1908.08962)
186
+ paper, we train smaller models (different layers and hidden sizes), and report number of parameters and pre-training costs:
187
+
188
+ | Model (Layer / Hidden size) | Parameters | Pre-Training time
189
+ | --------------------------- | ----------: | ----------------------:
190
+ | hmBERT Tiny ( 2/128) | 4.58M | 4.3 sec / 1,000 steps
191
+ | hmBERT Mini ( 4/256) | 11.55M | 10.5 sec / 1,000 steps
192
+ | hmBERT Small ( 4/512) | 29.52M | 20.7 sec / 1,000 steps
193
+ | hmBERT Medium ( 8/512) | 42.13M | 35.0 sec / 1,000 steps
194
+ | hmBERT Base (12/768) | 110.62M | 80.0 sec / 1,000 steps
195
+
196
+ We then perform downstream evaluations on the multilingual [NewsEye](https://zenodo.org/record/4573313#.Ya3oVr-ZNzU) dataset:
197
+
198
+ ![NewsEye hmBERT Evaluation](stats/figures/newseye-hmbert-evaluation.png)
199
+
200
+ # Pretraining
201
+
202
+ ## Multilingual model - hmBERT Base
203
+
204
+ We train a multilingual BERT model using the 32k vocab with the official BERT implementation
205
+ on a v3-32 TPU using the following parameters:
206
+
207
+ ```bash
208
+ python3 run_pretraining.py --input_file gs://histolectra/historic-multilingual-tfrecords/*.tfrecord \
209
+ --output_dir gs://histolectra/bert-base-historic-multilingual-cased \
210
+ --bert_config_file ./config.json \
211
+ --max_seq_length=512 \
212
+ --max_predictions_per_seq=75 \
213
+ --do_train=True \
214
+ --train_batch_size=128 \
215
+ --num_train_steps=3000000 \
216
+ --learning_rate=1e-4 \
217
+ --save_checkpoints_steps=100000 \
218
+ --keep_checkpoint_max=20 \
219
+ --use_tpu=True \
220
+ --tpu_name=electra-2 \
221
+ --num_tpu_cores=32
222
+ ```
223
+
224
+ The following plot shows the pretraining loss curve:
225
+
226
+ ![Training loss curve](stats/figures/pretraining_loss_historic-multilingual.png)
227
+
228
+ ## Smaller multilingual models
229
+
230
+ We use the same parameters as used for training the base model.
231
+
232
+ ### hmBERT Tiny
233
+
234
+ The following plot shows the pretraining loss curve for the tiny model:
235
+
236
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-tiny.png)
237
+
238
+ ### hmBERT Mini
239
+
240
+ The following plot shows the pretraining loss curve for the mini model:
241
+
242
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-mini.png)
243
+
244
+ ### hmBERT Small
245
+
246
+ The following plot shows the pretraining loss curve for the small model:
247
+
248
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-small.png)
249
+
250
+ ### hmBERT Medium
251
+
252
+ The following plot shows the pretraining loss curve for the medium model:
253
+
254
+ ![Training loss curve](stats/figures/pretraining_loss_hmbert-medium.png)
255
+
256
+ ## English model
257
+
258
+ The English BERT model - with texts from British Library corpus - was trained with the Hugging Face
259
+ JAX/FLAX implementation for 10 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
260
+
261
+ ```bash
262
+ python3 run_mlm_flax.py --model_type bert \
263
+ --config_name /mnt/datasets/bert-base-historic-english-cased/ \
264
+ --tokenizer_name /mnt/datasets/bert-base-historic-english-cased/ \
265
+ --train_file /mnt/datasets/bl-corpus/bl_1800-1900_extracted.txt \
266
+ --validation_file /mnt/datasets/bl-corpus/english_validation.txt \
267
+ --max_seq_length 512 \
268
+ --per_device_train_batch_size 16 \
269
+ --learning_rate 1e-4 \
270
+ --num_train_epochs 10 \
271
+ --preprocessing_num_workers 96 \
272
+ --output_dir /mnt/datasets/bert-base-historic-english-cased-512-noadafactor-10e \
273
+ --save_steps 2500 \
274
+ --eval_steps 2500 \
275
+ --warmup_steps 10000 \
276
+ --line_by_line \
277
+ --pad_to_max_length
278
+ ```
279
+
280
+ The following plot shows the pretraining loss curve:
281
+
282
+ ![Training loss curve](stats/figures/pretraining_loss_historic_english.png)
283
+
284
+ ## Finnish model
285
+
286
+ The BERT model - with texts from Finnish part of Europeana - was trained with the Hugging Face
287
+ JAX/FLAX implementation for 40 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
288
+
289
+ ```bash
290
+ python3 run_mlm_flax.py --model_type bert \
291
+ --config_name /mnt/datasets/bert-base-finnish-europeana-cased/ \
292
+ --tokenizer_name /mnt/datasets/bert-base-finnish-europeana-cased/ \
293
+ --train_file /mnt/datasets/hlms/extracted_content_Finnish_0.6.txt \
294
+ --validation_file /mnt/datasets/hlms/finnish_validation.txt \
295
+ --max_seq_length 512 \
296
+ --per_device_train_batch_size 16 \
297
+ --learning_rate 1e-4 \
298
+ --num_train_epochs 40 \
299
+ --preprocessing_num_workers 96 \
300
+ --output_dir /mnt/datasets/bert-base-finnish-europeana-cased-512-dupe1-noadafactor-40e \
301
+ --save_steps 2500 \
302
+ --eval_steps 2500 \
303
+ --warmup_steps 10000 \
304
+ --line_by_line \
305
+ --pad_to_max_length
306
+ ```
307
+
308
+ The following plot shows the pretraining loss curve:
309
+
310
+ ![Training loss curve](stats/figures/pretraining_loss_finnish_europeana.png)
311
+
312
+ ## Swedish model
313
+
314
+ The BERT model - with texts from Swedish part of Europeana - was trained with the Hugging Face
315
+ JAX/FLAX implementation for 40 epochs (approx. 660K steps) on a v3-8 TPU, using the following command:
316
+
317
+ ```bash
318
+ python3 run_mlm_flax.py --model_type bert \
319
+ --config_name /mnt/datasets/bert-base-swedish-europeana-cased/ \
320
+ --tokenizer_name /mnt/datasets/bert-base-swedish-europeana-cased/ \
321
+ --train_file /mnt/datasets/hlms/extracted_content_Swedish_0.6.txt \
322
+ --validation_file /mnt/datasets/hlms/swedish_validation.txt \
323
+ --max_seq_length 512 \
324
+ --per_device_train_batch_size 16 \
325
+ --learning_rate 1e-4 \
326
+ --num_train_epochs 40 \
327
+ --preprocessing_num_workers 96 \
328
+ --output_dir /mnt/datasets/bert-base-swedish-europeana-cased-512-dupe1-noadafactor-40e \
329
+ --save_steps 2500 \
330
+ --eval_steps 2500 \
331
+ --warmup_steps 10000 \
332
+ --line_by_line \
333
+ --pad_to_max_length
334
+ ```
335
+
336
+ The following plot shows the pretraining loss curve:
337
+
338
+ ![Training loss curve](stats/figures/pretraining_loss_swedish_europeana.png)
339
+
340
+ # Acknowledgments
341
+
342
+ Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) program, previously known as
343
+ TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
344
+
345
+ Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
346
+ it is possible to download both cased and uncased models from their S3 storage 🤗