nljubesi commited on
Commit
a49c6fa
1 Parent(s): c4cd688

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +20 -4
README.md CHANGED
@@ -17,9 +17,25 @@ This Electra model was trained on more than 8 billion tokens of Bosnian, Croatia
17
 
18
  ***new*** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
19
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
20
  Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
21
 
22
- ## Part-of-speech tagging
23
 
24
  Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
25
 
@@ -30,7 +46,7 @@ reldi-hr | Croatian | internet non-standard | - | 88.87 | 91.63 | **92.28&ast;&a
30
  SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
31
  reldi-sr | Serbian | internet non-standard | - | 91.26 | 93.54 | **93.90&ast;&ast;&ast;**
32
 
33
- ## Named entity recognition
34
 
35
  Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
36
 
@@ -42,7 +58,7 @@ SETimes.SR | Serbian | standard | 84.64 | **92.41** | 92.28 | 92.02
42
  reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92&ast;&ast;&ast;&ast;**
43
 
44
 
45
- ## Geolocation prediction
46
 
47
  The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
48
 
@@ -55,7 +71,7 @@ mBERT | 42.25 | 82.05
55
  cseBERT | 40.76 | 81.88
56
  BERTić | **37.96** | **79.30**
57
 
58
- ## Choice Of Plausible Alternatives
59
 
60
  The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).
61
 
17
 
18
  **&ast;new&ast;** We have published a version of this model fine-tuned on the named entity recognition task ([bcms-bertic-ner](https://huggingface.co/CLASSLA/bcms-bertic-ner)).
19
 
20
+ If you use the model, please cite the following paper:
21
+
22
+ ```
23
+ @inproceedings{ljubesic-lauc-2021-bertic,
24
+ title = "{BERTić} - The Transformer Language Model for {B}osnian, {C}roatian, {M}ontenegrin and {S}erbian",
25
+ author = "Ljube{\v{s}}i{\'c}, Nikola and
26
+ Lauc, Davor",
27
+ booktitle = "Proceedings of the 8th Workshop on Balto-Slavic Natural Language Processing",
28
+ year = "2021",
29
+ address = "Kiev, Ukraine",
30
+ publisher = "Association for Computational Linguistics"
31
+ }
32
+ ```
33
+
34
+ ## Benchmarking
35
+
36
  Comparing this model to [multilingual BERT](https://huggingface.co/bert-base-multilingual-cased) and [CroSloEngual BERT](https://huggingface.co/EMBEDDIA/crosloengual-bert) on the tasks of (1) part-of-speech tagging, (2) named entity recognition, (3) geolocation prediction, and (4) commonsense causal reasoning, shows the BERTić model to be superior to the other two.
37
 
38
+ ### Part-of-speech tagging
39
 
40
  Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
41
 
46
  SETimes.SR | Serbian | standard | 95.00 | 95.50 | **96.41** | 96.31
47
  reldi-sr | Serbian | internet non-standard | - | 91.26 | 93.54 | **93.90&ast;&ast;&ast;**
48
 
49
+ ### Named entity recognition
50
 
51
  Evaluation metric is (seqeval) microF1. Reported are means of five runs. Best results are presented in bold. Statistical significance is calculated between two best-performing systems via a two-tailed t-test (&ast; p<=0.05, &ast;&ast; p<=0.01, &ast;&ast;&ast; p<=0.001, &ast;&ast;&ast;&ast;&ast; p<=0.0001).
52
 
58
  reldi-sr | Serbian | internet non-standard | - | 81.29 | 82.76 | **87.92&ast;&ast;&ast;&ast;**
59
 
60
 
61
+ ### Geolocation prediction
62
 
63
  The dataset comes from the VarDial 2020 evaluation campaign's shared task on [Social Media variety Geolocation prediction](https://sites.google.com/view/vardial2020/evaluation-campaign). The task is to predict the latitude and longitude of a tweet given its text.
64
 
71
  cseBERT | 40.76 | 81.88
72
  BERTić | **37.96** | **79.30**
73
 
74
+ ### Choice Of Plausible Alternatives
75
 
76
  The dataset is a translation of the [COPA dataset](https://people.ict.usc.edu/~gordon/copa.html) into Croatian (to-be-released).
77