--- license: cc-by-sa-4.0 language: - hr - sl - bs - sr datasets: - classla/xlm-r-bertic-data --- # XLM-R-SloBertić This model was produced by pre-training [XLM-Roberta-large](https://huggingface.co/xlm-roberta-large) 48k steps on South Slavic languages using [XLM-R-BERTić dataset](https://huggingface.co/datasets/classla/xlm-r-bertic-data) # Benchmarking Three tasks were chosen for model evaluation: * Named Entity Recognition (NER) * Sentiment regression * COPA (Choice of plausible alternatives) In all cases, this model was finetuned for specific downstream tasks. ## NER Mean F1 scores were used to evaluate performance. Datasets used: [hr500k](https://huggingface.co/datasets/classla/hr500k), [ReLDI-sr](https://huggingface.co/datasets/classla/reldi_sr), [ReLDI-hr](https://huggingface.co/datasets/classla/reldi_hr), and [SETimes.SR](https://huggingface.co/datasets/classla/setimes_sr). | system | dataset | F1 score | |:-----------------------------------------------------------------------|:--------|---------:| | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | hr500k | 0.927 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | hr500k | 0.925 | | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | hr500k | 0.923 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | hr500k | 0.919 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | hr500k | 0.918 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | hr500k | 0.903 | | system | dataset | F1 score | |:-----------------------------------------------------------------------|:---------|---------:| | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | ReLDI-hr | 0.812 | | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | ReLDI-hr | 0.809 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-hr | 0.794 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ReLDI-hr | 0.792 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | ReLDI-hr | 0.791 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | ReLDI-hr | 0.763 | | system | dataset | F1 score | |:-----------------------------------------------------------------------|:-----------|---------:| | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | SETimes.SR | 0.949 | | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | SETimes.SR | 0.940 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | SETimes.SR | 0.936 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | SETimes.SR | 0.933 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | SETimes.SR | 0.922 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | SETimes.SR | 0.914 | | system | dataset | F1 score | |:-----------------------------------------------------------------------|:---------|---------:| | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | ReLDI-sr | 0.841 | | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | ReLDI-sr | 0.824 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ReLDI-sr | 0.798 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | ReLDI-sr | 0.774 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ReLDI-sr | 0.751 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | ReLDI-sr | 0.734 | ## Sentiment regression [ParlaSent dataset](https://huggingface.co/datasets/classla/ParlaSent) was used to evaluate sentiment regression for Bosnian, Croatian, and Serbian languages. The procedure is explained in greater detail in the dedicated [benchmarking repository](https://github.com/clarinsi/benchich/tree/main/sentiment). | system | train | test | r^2 | |:-----------------------------------------------------------------------|:--------------------|:-------------------------|------:| | [xlm-r-parlasent](https://huggingface.co/classla/xlm-r-parlasent) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.615 | | [BERTić](https://huggingface.co/classla/bcms-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.612 | | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.607 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.605 | | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.601 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.537 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | 0.500 | | dummy (mean) | ParlaSent_BCS.jsonl | ParlaSent_BCS_test.jsonl | -0.12 | ## COPA | system | dataset | Accuracy score | |:-----------------------------------------------------------------------|:--------|---------------:| | [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-SR | 0.689 | | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | Copa-SR | 0.665 | | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | Copa-SR | 0.637 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-SR | 0.607 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | Copa-SR | 0.573 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | Copa-SR | 0.570 | | system | dataset | Accuracy score | |:-----------------------------------------------------------------------|:--------|---------------:| | [BERTić](https://huggingface.co/classla/bcms-bertic) | Copa-HR | 0.669 | | [**XLM-R-SloBERTić**](https://huggingface.co/classla/xlm-r-slobertic) | Copa-HR | 0.628 | | [XLM-R-BERTić](https://huggingface.co/classla/xlm-r-bertic) | Copa-HR | 0.635 | | [crosloengual-bert](https://huggingface.co/EMBEDDIA/crosloengual-bert) | Copa-HR | 0.669 | | [XLM-Roberta-Base](https://huggingface.co/xlm-roberta-base) | Copa-HR | 0.585 | | [XLM-Roberta-Large](https://huggingface.co/xlm-roberta-large) | Copa-HR | 0.571 | # Citation The following paper has been submitted for review: ``` @misc{ljubesic2024language, author = "Ljube\v{s}i\'{c}, Nikola and Suchomel, Vit and Rupnik, Peter and Kuzman, Taja and van Noord, Rik", title = "Language Models on a Diet: Cost-Efficient Development of Encoders for Closely-Related Languages via Additional Pretraining", howpublished = "Submitted for review", year = "2024", } ```