language: de
license: mit
tags:
- historic german
π€ + π dbmdz BERT models
In this repository the MDZ Digital Library team (dbmdz) at the Bavarian State Library open sources German Europeana BERT models π
German Europeana BERT
We use the open source Europeana newspapers that were provided by The European Library. The final training corpus has a size of 51GB and consists of 8,035,986,369 tokens.
Detailed information about the data and pretraining steps can be found in this repository.
Model weights
Currently only PyTorch-Transformers compatible weights are available. If you need access to TensorFlow checkpoints, please raise an issue!
Model | Downloads |
---|---|
dbmdz/bert-base-german-europeana-cased |
config.json β’ pytorch_model.bin β’ vocab.txt |
Results
For results on Historic NER, please refer to this repository.
Usage
With Transformers >= 2.3 our German Europeana BERT models can be loaded like:
from transformers import AutoModel, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("dbmdz/bert-base-german-europeana-cased")
model = AutoModel.from_pretrained("dbmdz/bert-base-german-europeana-cased")
Huggingface model hub
All models are available on the Huggingface model hub.
Contact (Bugs, Feedback, Contribution and more)
For questions about our BERT models just open an issue here π€
Acknowledgments
Research supported with Cloud TPUs from Google's TensorFlow Research Cloud (TFRC). Thanks for providing access to the TFRC β€οΈ
Thanks to the generous support from the Hugging Face team, it is possible to download both cased and uncased models from their S3 storage π€