metadata
license: cc-by-sa-4.0
language:
- hr
- sl
- bs
- sr
datasets:
- classla/xlm-r-bertic-data
XLM-R-SloBertić
This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages using XLM-R-BERTić dataset
Benchmarking
Three tasks were chosen for model evaluation:
- Named Entity Recognition (NER)
- Sentiment regression
- COPA (Choice of plausible alternatives)
In all cases, this model was finetuned for specific downstream tasks.
NER
Mean F1 scores were used to evaluate performance. Datasets used: hr500k, ReLDI-sr, ReLDI-hr, and SETimes.SR.
system | dataset | F1 score |
---|---|---|
XLM-R-BERTić | hr500k | 0.927 |
BERTić | hr500k | 0.925 |
XLM-R-SloBERTić | hr500k | 0.923 |
XLM-Roberta-Large | hr500k | 0.919 |
crosloengual-bert | hr500k | 0.918 |
XLM-Roberta-Base | hr500k | 0.903 |
system | dataset | F1 score |
---|---|---|
XLM-R-SloBERTić | ReLDI-hr | 0.812 |
XLM-R-BERTić | ReLDI-hr | 0.809 |
crosloengual-bert | ReLDI-hr | 0.794 |
BERTić | ReLDI-hr | 0.792 |
XLM-Roberta-Large | ReLDI-hr | 0.791 |
XLM-Roberta-Base | ReLDI-hr | 0.763 |
system | dataset | F1 score |
---|---|---|
XLM-R-SloBERTić | SETimes.SR | 0.949 |
XLM-R-BERTić | SETimes.SR | 0.940 |
BERTić | SETimes.SR | 0.936 |
XLM-Roberta-Large | SETimes.SR | 0.933 |
crosloengual-bert | SETimes.SR | 0.922 |
XLM-Roberta-Base | SETimes.SR | 0.914 |
system | dataset | F1 score |
---|---|---|
XLM-R-BERTić | ReLDI-sr | 0.841 |
XLM-R-SloBERTić | ReLDI-sr | 0.824 |
BERTić | ReLDI-sr | 0.798 |
XLM-Roberta-Large | ReLDI-sr | 0.774 |
crosloengual-bert | ReLDI-sr | 0.751 |
XLM-Roberta-Base | ReLDI-sr | 0.734 |
Sentiment regression
ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated benchmarking repository.
system | train | test | r^2 |
---|---|---|---|
xlm-r-parlasent | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 |
BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 |
XLM-R-SloBERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 |
XLM-Roberta-Large | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 |
XLM-R-BERTić | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 |
crosloengual-bert | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 |
XLM-Roberta-Base | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 |
dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 |
COPA
system | dataset | Accuracy score |
---|---|---|
BERTić | Copa-SR | 0.689 |
XLM-R-SloBERTić | Copa-SR | 0.665 |
XLM-R-BERTić | Copa-SR | 0.637 |
crosloengual-bert | Copa-SR | 0.607 |
XLM-Roberta-Base | Copa-SR | 0.573 |
XLM-Roberta-Large | Copa-SR | 0.570 |
system | dataset | Accuracy score |
---|---|---|
BERTić | Copa-HR | 0.669 |
XLM-R-SloBERTić | Copa-HR | 0.628 |
XLM-R-BERTić | Copa-HR | 0.635 |
crosloengual-bert | Copa-HR | 0.669 |
XLM-Roberta-Base | Copa-HR | 0.585 |
XLM-Roberta-Large | Copa-HR | 0.571 |
Citation
The following paper has been submitted for review:
@misc{ljubesic2024language,
author = "Ljube\v{s}i\'{c}, Nikola and Suchomel, Vit and Rupnik, Peter and Kuzman, Taja and van Noord, Rik",
title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining",
howpublished = "Submitted for review",
year = "2024",
}