BERTislav

Baseline fill-mask model based on ruBERT and fine-tuned on a 10M-word corpus of mixed Old Church Slavonic, (Later) Church Slavonic, Old East Slavic, Middle Russian, and Medieval Serbian texts.

Overview

Model Name: BERTislav
Task: Fill-mask
Base Model: ai-forever/ruBert-base
Languages: orv (Old East Slavic, Middle Russian), cu (Old Church Slavonic, Church Slavonic)
Developed by: Nilo Pedrazzini

Input Format

A str-type input with [MASK]ed tokens.

Output Format

The predicted token, with the confidence score for each labels.

Examples

Example 1:

COMING SOON

Uses

The model can be used as a baseline model for further finetuning to perform specific downstream tasks (e.g. linguistic annotation).

Bias, Risks, and Limitations

The model should only be considered a baseline, and should not be evaluated on its own. Testing is needed regarding its usefulness to improve the performance of language models finetuned for specific tasks.

Training Details

The texts used as training data are from the following sources:

Fundamental Digital Library Russian Literature & Folklore (FEB-web)
Puškinskij Dom's Библиотека литературы Древней Руси
Cyrillomethodiana
Parts of the Bdinski Sbornik, as digitized in Obdurodon.
Tromsø Old Russian and Old Church Slavonic Treebank (TOROT).

NB: Texts were heavily normalized and anyone planning to use the model is advised to do the same for the best outcome. Use the provided normalization script, customizing it as needed.

Model Card Authors

Nilo Pedrazzini

Model Card Contact

npedrazzini@turing.ac.uk

How to use the model

COMING SOON