|
--- |
|
license: apache-2.0 |
|
metrics: |
|
- perplexity |
|
pipeline_tag: fill-mask |
|
language: |
|
- orv |
|
- cu |
|
tags: |
|
- roberta-based |
|
- old church slavonic |
|
- old east slavic |
|
- old russian |
|
- middle russian |
|
- early slavic |
|
widget: |
|
- text: >- |
|
моли непрестанно о всѣхъ [MASK], честную память твою присно въ пѣснехъ почитающихъ |
|
example_title: Example 1 |
|
- text: >- |
|
да испишеть имѧна ваша. [MASK] возмуть мѣсѧчное свое съли слебное |
|
example_title: Example 2 |
|
--- |
|
|
|
# BERTislav |
|
|
|
Baseline fill-mask model based on ruBERT and fine-tuned on a 10M-word corpus of mixed Old Church Slavonic, (Later) Church Slavonic, Old East Slavic, Middle Russian, and Medieval Serbian texts. |
|
|
|
# Overview |
|
- **Model Name:** BERTislav |
|
- **Task**: Fill-mask |
|
- **Base Model:** [ai-forever/ruBert-base](https://huggingface.co/ai-forever/ruBert-base) |
|
- **Languages:** orv (Old East Slavic, Middle Russian), cu (Old Church Slavonic, Church Slavonic) |
|
- **Developed by:** [Nilo Pedrazzini](https://huggingface.co/npedrazzini) |
|
|
|
# Input Format |
|
A `str`-type input with [MASK]ed tokens. |
|
|
|
# Output Format |
|
The predicted token, with the confidence score for each labels. |
|
|
|
# Examples |
|
|
|
### Example 1: |
|
|
|
COMING SOON |
|
|
|
# Uses |
|
The model can be used as a baseline model for further finetuning to perform specific downstream tasks (e.g. linguistic annotation). |
|
|
|
# Bias, Risks, and Limitations |
|
The model should only be considered a baseline, and should **not** be evaluated on its own. |
|
Testing is needed regarding its usefulness to improve the performance of language models finetuned for specific tasks. |
|
|
|
# Training Details |
|
|
|
The texts used as training data are from the following sources: |
|
- [Fundamental Digital Library Russian Literature & Folklore](https://feb-web.ru/indexen.htm) (FEB-web) |
|
- Puškinskij Dom's [*Библиотека литературы Древней Руси*](http://lib.pushkinskijdom.ru/Default.aspx?tabid=2070) |
|
- [Cyrillomethodiana](https://histdict.uni-sofia.bg/) |
|
- Parts of the Bdinski Sbornik, as digitized in [Obdurodon](http://bdinski.obdurodon.org/). |
|
- [Tromsø Old Russian and Old Church Slavonic Treebank](https://torottreebank.github.io/) (TOROT). |
|
|
|
**NB: Texts were heavily normalized and anyone planning to use the model is advised to do the same for the best outcome. |
|
Use the [provided normalization script](https://huggingface.co/npedrazzini/BERTislav/blob/main/normalize.py), customizing it as needed.** |
|
|
|
# Model Card Authors |
|
|
|
Nilo Pedrazzini |
|
|
|
# Model Card Contact |
|
|
|
npedrazzini@turing.ac.uk |
|
|
|
# How to use the model |
|
|