Edit model card

XLM-R-BERTić

This model was produced by pre-training XLM-Roberta-large 48k steps on South Slavic languages using XLM-R-BERTić dataset

Benchmarking

Three tasks were chosen for model evaluation:

  • Named Entity Recognition (NER)
  • Sentiment regression
  • COPA (Choice of plausible alternatives)

In all cases, this model was finetuned for specific downstream tasks.

NER

Mean F1 scores were used to evaluate performance. Datasets used: hr500k, ReLDI-sr, ReLDI-hr, and SETimes.SR.

system dataset F1 score
XLM-R-BERTić hr500k 0.927
BERTić hr500k 0.925
XLM-R-SloBERTić hr500k 0.923
XLM-Roberta-Large hr500k 0.919
crosloengual-bert hr500k 0.918
XLM-Roberta-Base hr500k 0.903
system dataset F1 score
XLM-R-SloBERTić ReLDI-hr 0.812
XLM-R-BERTić ReLDI-hr 0.809
crosloengual-bert ReLDI-hr 0.794
BERTić ReLDI-hr 0.792
XLM-Roberta-Large ReLDI-hr 0.791
XLM-Roberta-Base ReLDI-hr 0.763
system dataset F1 score
XLM-R-SloBERTić SETimes.SR 0.949
XLM-R-BERTić SETimes.SR 0.940
BERTić SETimes.SR 0.936
XLM-Roberta-Large SETimes.SR 0.933
crosloengual-bert SETimes.SR 0.922
XLM-Roberta-Base SETimes.SR 0.914
system dataset F1 score
XLM-R-BERTić ReLDI-sr 0.841
XLM-R-SloBERTić ReLDI-sr 0.824
BERTić ReLDI-sr 0.798
XLM-Roberta-Large ReLDI-sr 0.774
crosloengual-bert ReLDI-sr 0.751
XLM-Roberta-Base ReLDI-sr 0.734

Sentiment regression

ParlaSent dataset was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated benchmarking repository.

system train test r^2
xlm-r-parlasent ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.615
BERTić ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.612
XLM-R-SloBERTić ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.607
XLM-Roberta-Large ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.605
XLM-R-BERTić ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.601
crosloengual-bert ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.537
XLM-Roberta-Base ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl 0.500
dummy (mean) ParlaSent_BCS.jsonl ParlaSent_BCS_test.jsonl -0.12

COPA

system dataset Accuracy score
BERTić Copa-SR 0.689
XLM-R-SloBERTić Copa-SR 0.665
XLM-R-BERTić Copa-SR 0.637
crosloengual-bert Copa-SR 0.607
XLM-Roberta-Base Copa-SR 0.573
XLM-Roberta-Large Copa-SR 0.570
system dataset Accuracy score
BERTić Copa-HR 0.669
XLM-R-SloBERTić Copa-HR 0.628
XLM-R-BERTić Copa-HR 0.635
crosloengual-bert Copa-HR 0.669
XLM-Roberta-Base Copa-HR 0.585
XLM-Roberta-Large Copa-HR 0.571

Citation

Please cite the following paper:

 @article{Ljubešić_Suchomel_Rupnik_Kuzman_van Noord_2024,
title={Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining},
url={http://arxiv.org/abs/2404.05428},
DOI={10.48550/arXiv.2404.05428},
abstractNote={The world of language models is going through turbulent times, better and ever larger models are coming out at an unprecedented speed. However, we argue that, especially for the scientific community, encoder models of up to 1 billion parameters are still very much needed, their primary usage being in enriching large collections of data with metadata necessary for downstream research. We investigate the best way to ensure the existence of such encoder models on the set of very closely related languages - Croatian, Serbian, Bosnian and Montenegrin, by setting up a diverse benchmark for these languages, and comparing the trained-from-scratch models with the new models constructed via additional pretraining of existing multilingual models. We show that comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation. We also show that neighboring languages, in our case Slovenian, can be included in the additional pretraining with little to no loss in the performance of the final model.},
note={arXiv:2404.05428 [cs]},
 number={arXiv:2404.05428},
publisher={arXiv},
author={Ljubešić, Nikola and Suchomel, Vít and Rupnik, Peter and Kuzman, Taja and van Noord, Rik},
year={2024},
month=apr
}
Downloads last month
18

Dataset used to train classla/xlm-r-bertic