RoBERTa Latin model, version 3 --> model card not finished yet
This is a Latin RoBERTa-based LM model, version 3.
The intention of the Transformer-based LM is twofold: on the one hand, it will be used for the evaluation of HTR results; on the other, it should be used as a decoder for the TrOCR architecture.
The training data differs from the one used in the RoBERTa Bas Latin Cased V1 and V2, and therefore also by what is used by Bamman and Burns (2020). We exclusively used the text from the Corpus Corporum collected and maintained by the University of Zurich.
The overall corpus contains 1.5G of text data (3x as much as has been used for V2 and very likely of better quality).
Preprocessing
I undertook the following preprocessing steps:
- Normalisation of all lines with CLTK incl. sentence splitting.
- Language identification with langid
- Retaining only Latin lines.
The result is a corpus of ~232 million tokens.
The dataset used to train this will be available on Hugging Face later HERE (does not work yet).
Contact
For contact, reach out to Phillip Ströbel via mail or via Twitter.
How to cite
If you use this model, pleas cite it as:
@online{stroebel-roberta-base-latin-cased3,
author = {Ströbel, Phillip Benjamin},
title = {RoBERTa Base Latin Cased V2},
year = 2022,
url = {https://huggingface.co/pstroe/roberta-base-latin-cased3},
urldate = {YYYY-MM-DD}
}