stefan-it commited on
Commit
514a9e7
1 Parent(s): e14d20d

readme: update

Browse files
Files changed (1) hide show
  1. README.md +101 -2
README.md CHANGED
@@ -11,6 +11,8 @@ widget:
11
 
12
  # Historic Language Models (HLMs)
13
 
 
 
14
  Our Historic Language Models Zoo contains support for the following languages - incl. their training data source:
15
 
16
  | Language | Training data | Size
@@ -21,6 +23,17 @@ Our Historic Language Models Zoo contains support for the following languages -
21
  | Finnish | [Europeana](http://www.europeana-newspapers.eu/) | 1.2GB
22
  | Swedish | [Europeana](http://www.europeana-newspapers.eu/) | 1.1GB
23
 
 
 
 
 
 
 
 
 
 
 
 
24
  # Corpora Stats
25
 
26
  ## German Europeana Corpus
@@ -152,6 +165,8 @@ Total size is 130GB.
152
 
153
  # Pretraining
154
 
 
 
155
  We train a multilingual BERT model using the 32k vocab with the official BERT implementation
156
  on a v3-32 TPU using the following parameters:
157
 
@@ -174,7 +189,91 @@ python3 run_pretraining.py --input_file gs://histolectra/historic-multilingual-t
174
 
175
  The following plot shows the pretraining loss curve:
176
 
177
- ![Training loss curve](stats/figures/pretraining_loss.png)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
178
 
179
  # Acknowledgments
180
 
@@ -182,4 +281,4 @@ Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC) progra
182
  TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
183
 
184
  Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
185
- it is possible to download both cased and uncased models from their S3 storage 🤗
 
11
 
12
  # Historic Language Models (HLMs)
13
 
14
+ ## Languages
15
+
16
  Our Historic Language Models Zoo contains support for the following languages - incl. their training data source:
17
 
18
  | Language | Training data | Size
 
23
  | Finnish | [Europeana](http://www.europeana-newspapers.eu/) | 1.2GB
24
  | Swedish | [Europeana](http://www.europeana-newspapers.eu/) | 1.1GB
25
 
26
+ ## Models
27
+
28
+ At the moment, the following models are available on the model hub:
29
+
30
+ | Model identifier | Model Hub link
31
+ | --------------------------------------------- | --------------------------------------------------------------------------
32
+ | `dbmdz/bert-base-historic-multilingual-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-multilingual-cased)
33
+ | `dbmdz/bert-base-historic-english-cased` | [here](https://huggingface.co/dbmdz/bert-base-historic-english-cased)
34
+ | `dbmdz/bert-base-finnish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-finnish-europeana-cased)
35
+ | `dbmdz/bert-base-swedish-europeana-cased` | [here](https://huggingface.co/dbmdz/bert-base-swedish-europeana-cased)
36
+
37
  # Corpora Stats
38
 
39
  ## German Europeana Corpus
 
165
 
166
  # Pretraining
167
 
168
+ ## Multilingual model
169
+
170
  We train a multilingual BERT model using the 32k vocab with the official BERT implementation
171
  on a v3-32 TPU using the following parameters:
172
 
 
189
 
190
  The following plot shows the pretraining loss curve:
191
 
192
+ ![Training loss curve](stats/figures/pretraining_loss_historic-multilingual.png)
193
+
194
+ ## English model
195
+
196
+ The English BERT model - with texts from British Library corpus - was trained with the Hugging Face
197
+ JAX/FLAX implementation for 10 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
198
+
199
+ ```bash
200
+ python3 run_mlm_flax.py --model_type bert \
201
+ --config_name /mnt/datasets/bert-base-historic-english-cased/ \
202
+ --tokenizer_name /mnt/datasets/bert-base-historic-english-cased/ \
203
+ --train_file /mnt/datasets/bl-corpus/bl_1800-1900_extracted.txt \
204
+ --validation_file /mnt/datasets/bl-corpus/english_validation.txt \
205
+ --max_seq_length 512 \
206
+ --per_device_train_batch_size 16 \
207
+ --learning_rate 1e-4 \
208
+ --num_train_epochs 10 \
209
+ --preprocessing_num_workers 96 \
210
+ --output_dir /mnt/datasets/bert-base-historic-english-cased-512-noadafactor-10e \
211
+ --save_steps 2500 \
212
+ --eval_steps 2500 \
213
+ --warmup_steps 10000 \
214
+ --line_by_line \
215
+ --pad_to_max_length
216
+ ```
217
+
218
+ The following plot shows the pretraining loss curve:
219
+
220
+ ![Training loss curve](stats/figures/pretraining_loss_historic_english.png)
221
+
222
+ ## Finnish model
223
+
224
+ The BERT model - with texts from Finnish part of Europeana - was trained with the Hugging Face
225
+ JAX/FLAX implementation for 40 epochs (approx. 1M steps) on a v3-8 TPU, using the following command:
226
+
227
+ ```bash
228
+ python3 run_mlm_flax.py --model_type bert \
229
+ --config_name /mnt/datasets/bert-base-finnish-europeana-cased/ \
230
+ --tokenizer_name /mnt/datasets/bert-base-finnish-europeana-cased/ \
231
+ --train_file /mnt/datasets/hlms/extracted_content_Finnish_0.6.txt \
232
+ --validation_file /mnt/datasets/hlms/finnish_validation.txt \
233
+ --max_seq_length 512 \
234
+ --per_device_train_batch_size 16 \
235
+ --learning_rate 1e-4 \
236
+ --num_train_epochs 40 \
237
+ --preprocessing_num_workers 96 \
238
+ --output_dir /mnt/datasets/bert-base-finnish-europeana-cased-512-dupe1-noadafactor-40e \
239
+ --save_steps 2500 \
240
+ --eval_steps 2500 \
241
+ --warmup_steps 10000 \
242
+ --line_by_line \
243
+ --pad_to_max_length
244
+ ```
245
+
246
+ The following plot shows the pretraining loss curve:
247
+
248
+ ![Training loss curve](stats/figures/pretraining_loss_finnish_europeana.png)
249
+
250
+ ## Swedish model
251
+
252
+ The BERT model - with texts from Swedish part of Europeana - was trained with the Hugging Face
253
+ JAX/FLAX implementation for 40 epochs (approx. 660K steps) on a v3-8 TPU, using the following command:
254
+
255
+ ```bash
256
+ python3 run_mlm_flax.py --model_type bert \
257
+ --config_name /mnt/datasets/bert-base-swedish-europeana-cased/ \
258
+ --tokenizer_name /mnt/datasets/bert-base-swedish-europeana-cased/ \
259
+ --train_file /mnt/datasets/hlms/extracted_content_Swedish_0.6.txt \
260
+ --validation_file /mnt/datasets/hlms/swedish_validation.txt \
261
+ --max_seq_length 512 \
262
+ --per_device_train_batch_size 16 \
263
+ --learning_rate 1e-4 \
264
+ --num_train_epochs 40 \
265
+ --preprocessing_num_workers 96 \
266
+ --output_dir /mnt/datasets/bert-base-swedish-europeana-cased-512-dupe1-noadafactor-40e \
267
+ --save_steps 2500 \
268
+ --eval_steps 2500 \
269
+ --warmup_steps 10000 \
270
+ --line_by_line \
271
+ --pad_to_max_length
272
+ ```
273
+
274
+ The following plot shows the pretraining loss curve:
275
+
276
+ ![Training loss curve](stats/figures/pretraining_loss_swedish_europeana.png)
277
 
278
  # Acknowledgments
279
 
 
281
  TensorFlow Research Cloud (TFRC). Many thanks for providing access to the TRC ❤️
282
 
283
  Thanks to the generous support from the [Hugging Face](https://huggingface.co/) team,
284
+ it is possible to download both cased and uncased models from their S3 storage 🤗