README.md · pstroe/roberta-base-latin-cased2 at refs/pr/1

RoBERTa Latin model, version 2 --> model card not finished yet

This is a Latin RoBERTa-based LM model, version 2.

The intention of the Transformer-based LM is twofold: on the one hand, it will be used for the evaluation of HTR results; on the other, it should be used as a decoder for the TrOCR architecture.

The training data is more or less the same data as has been used by Bamman and Burns (2020), although more heavily filtered (see below). There are several digital-born texts from online Latin archives. Other Latin texts have been crawled by Bamman and Smith and thus contain many OCR errors.

The overall downsampled corpus contains 577M of text data.

Preprocessing

I undertook the following preprocessing steps:

Normalisation of all lines with CLTK incl. sentence splitting.
Language identification with langid
Compute the ratio of Latin vocabulary in each sentence (against the digital-born vocab of the corpus)
Retain only sentences with a Latin vocabulary ratio of > 85%.
Exclude all lines containing '^' --> hints at the presence of OCR errors.

The result is a corpus of ~100 million tokens.

The dataset used to train this will be available on Hugging Face later HERE (does not work yet).

Contact

For contact, reach out to Phillip Ströbel via mail or via Twitter.

How to cite

If you use this model, pleas cite it as:

@online{stroebel-roberta-base-latin-cased2,
    author = {Ströbel, Phillip Benjamin},
    title = {RoBERTa Base Latin Cased V2},
    year = 2022,
    url = {https://huggingface.co/pstroe/roberta-base-latin-cased2},
    urldate = {YYYY-MM-DD}
}