Edit model card

Model Details

This is a Fine-tuned version of the multilingual Bert model on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.

The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.

Several big corpora were cleaned and transformed to be used during the training process :

dataset size Lang dates
CC100 [1] 3,2Gb la 5th BC - 18th
Corpus Corporum [2] 3,0Gb la 5th BC - 16th
CEMA [3] 320Mb la+fro 9th - 15th
HOME-Alcar [4] 38Mb la+fro 12th - 15th
BFM [5] 34Mb fro 13th - 15th
AND [6] 19Mb fro 13th - 15th
CODEA [7] 13Mb spa 12th - 16th
~6,5Gb
650M tokens (4,5Gb)*
  • A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.

[1] CC-NET Repository : https://huggingface.co/datasets/cc100

[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/

[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/

[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884

[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/

[6] Anglo-Normand Dictionary : https://anglo-norman.net/

[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/

Downloads last month
3
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Datasets used to train magistermilitum/bert_medieval_multilingual