bcms-bertic / README.md
nljubesi's picture
Update README.md
5913279
|
raw
history blame
3.08 kB
metadata
language:
  - hr
  - bs
  - sr
  - cnr
  - hbs
tags:
  - fill-mask
license: apache-2.0

BERTić [bert-ich] /bɜrtitʃ/ - A BERT model for Bosnian, Croatian, Montenegrin and Serbian

This Electra model was trained on more than 6 billion tokens of Bosnian, Croatian, Montenegrin and Serbian text.

Comparing this model to multilingual BERT and CroSloEngual BERT on the tasks of part-of-speech tagging, named entity recognition, geolocation prediction and choice of plausible alternatives shows this model to be superior to the other two.

Part-of-speech tagging

Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).

Dataset Language Variety CLASSLA mBERT cseBERT BERTić
hr500k Croatian standard 93.87 94.60 95.74 ***95.81
reldi-hr Croatian internet non-standard - 88.87 91.63 ***92.28
SETimes.SR Serbian standard 95.00 95.50 96.41 96.31
reldi-sr Serbian internet non-standard - 91.26 93.54 ***93.90

Named entity recognition

Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).

Dataset Language Variety CLASSLA mBERT cseBERT BERTić
hr500k Croatian standard 80.13 85.67 88.98 ****89.21
reldi-hr Croatian internet non-standard - 76.06 81.38 ****83.05
SETimes.SR Serbian standard 84.64 92.41 92.28 92.02
reldi-sr Serbian internet non-standard - 81.29 82.76 ****87.92

Geolocation prediction

Evaluation metrics are median and mean of distance between gold and predicted geolocations (lower is better). No statistical significance is computed due to large test set (39,723 instances). Centroid baseline predicts each text to be created in the centroid of the training dataset.

System Median Mean
centroid 107.10 145.72
mBERT 42.25 82.05
cseBERT 40.76 81.88
BERTić 37.96 79.30

Choice Of Plausible Alternatives (translation to Croatian)

Evaluation metric is accuracy. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (* p<=0.05, ** p<=0.01, *** p<=0.001, ***** p<=0.0001).

System Accuracy
random 50.00
mBERT 54.12
cseBERT 61.80
BERTić **65.76