classla
/

bcms-bertic

Transformers PyTorch 5 languages electra pretraining Inference Endpoints

Model card Files Files and versions Community

nljubesi commited on Feb 23, 2021

Commit

a49c6fa

•

1 Parent(s): c4cd688

Update README.md

Browse files

Files changed (1) hide show

README.md +20 -4

README.md CHANGED Viewed

@@ -17,9 +17,25 @@ This Electra model was trained on more than 8 billion tokens of Bosnian, Croatia
 **&ast;new&ast;** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
 Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
-## Part-of-speech tagging
 Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
@@ -30,7 +46,7 @@ reldi-hr | Croatian | internet non-standard | - | 88.87 | 91.63 | **92.28&ast;&a
 SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
 reldi-sr | Serbian | internet non-standard | - | 91.26 |  93.54 | **93.90&ast;&ast;&ast;**
-## Named entity recognition
 Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
@@ -42,7 +58,7 @@ SETimes.SR | Serbian | standard | 84.64 | **92.41** | 92.28 | 92.02
 reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92&ast;&ast;&ast;&ast;**
-## Geolocation prediction
 The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
@@ -55,7 +71,7 @@ mBERT | 42.25 | 82.05
 cseBERT | 40.76 | 81.88
 BERTić | **37.96** | **79.30**
-## Choice Of Plausible Alternatives
 The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).

 **&ast;new&ast;** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
+If you use the model, please cite the following paper:
+```
+@inproceedings{ljubesic-lauc-2021-bertic,
+    title = "{BERTić} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
+    author = "Ljube{\v{s}}i{\'c}, Nikola  and
+      Lauc, Davor",
+    booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
+    year = "2021",
+    address = "Kiev, Ukraine",
+    publisher = "Association for Computational Linguistics"
+}
+```
+## Benchmarking
 Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
+### Part-of-speech tagging
 Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
 SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
 reldi-sr | Serbian | internet non-standard | - | 91.26 |  93.54 | **93.90&ast;&ast;&ast;**
+### Named entity recognition
 Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
 reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92&ast;&ast;&ast;&ast;**
+### Geolocation prediction
 The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
 cseBERT | 40.76 | 81.88
 BERTić | **37.96** | **79.30**
+### Choice Of Plausible Alternatives
 The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).