File size: 1,992 Bytes
44e2ce9 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 |
---
license: mit
widget:
- text: Universis presentes [MASK] inspecturis
- text: eandem [MASK] per omnia parati observare
- text: yo [MASK] rey de Galicia, de las Indias
- text: en avant contre les choses [MASK] contenues
datasets:
- cc100
- bigscience-historical-texts/Open_Medieval_French
- latinwikipedia
language:
- la
- fr
- es
---
## Model Details
This is a RoBERTa model trained from scratch on medieval texts. The model is intended to be used as a fondation for other ML tasks on NLP and HTR environments.
The train dataset entails 650M of tokens coming from texts on classical and medieval latin; old french and old Spanish from a period ranging from 5th BC to 16th centuries.
Several big corpora were cleaned and transformed to be used during the training process :
| dataset | size | Lang | dates |
| ------------- |:-------------:| -----:|-----:|
| CC100 [1] | 3,2Gb | la | 5th BC - 18th|
| Corpus Corporum [2] | 3,0Gb | la | 5th BC - 16th |
| CEMA [3] | 320Mb | la+fro |9th - 15th |
| HOME-Alcar [4] | 38Mb | la+fro | 12th - 15th |
| BFM [5] | 34Mb | fro | 13th - 15th|
| AND [6] | 19Mb | fro | 13th - 15th|
| CODEA [7] | 13Mb | spa |12th - 16th |
| | ~6,5Gb | |
| | 650M tokens (4,5Gb)* | | |
* A significant overlapped quantity of text was detected across the corpora, specially on medieval collections. Besides, synthetic text ("Lorem ipsum dolorem...") was iteratively deleted.
[1] CC-NET Repository : https://huggingface.co/datasets/cc100
[2] Repositorium operum lationorum apud universitatem Turicensem : https://mlat.uzh.ch/
[3] Cartae Europae Medii Aevi (5th-15th c.) : https://cema.lamop.fr/
[4] History of Medieval Europe : https://doi.org/10.5281/zenodo.5600884
[5] Base du Français Médieval : https://txm-bfm.huma-num.fr/txm/
[6] Anglo-Normand Dictionary : https://anglo-norman.net/
[7] Corpus de Docuemntos Españoles anteriores a 1900: https://www.corpuscodea.es/ |