RVN commited on
Commit
4ba9940
1 Parent(s): d6e689b

First version of README

Browse files
Files changed (1) hide show
  1. README.md +84 -0
README.md CHANGED
@@ -1,3 +1,87 @@
1
  ---
2
  license: cc0-1.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc0-1.0
3
+ language:
4
+ - bg
5
+ - mk
6
+ tags:
7
+ - BERTovski
8
  ---
9
+
10
+ # Model description
11
+
12
+ **BERTovski** is a large pre-trained language model trained on Bulgarian and Macedonian texts. It was trained from scratch using the RoBERTa architecture. It was developed as part of the [MaCoCu](https://macocu.eu/) project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.
13
+
14
+ BERTovski was trained on 74GB of text, which is equal to just over 7 billion tokens. It was trained for 300,000 steps with a batch size of 2,048, which was approximately 30 epochs.
15
+
16
+ The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels). We aim to train this model for even longer, so keep an eye out for newer versions!
17
+
18
+ # How to use
19
+
20
+ ```python
21
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
22
+
23
+ tokenizer = AutoTokenizer.from_pretrained("RVN/BERTovski")
24
+ model = AutoModel.from_pretrained("RVN/BERTovski") # PyTorch
25
+ model = TFAutoModel.from_pretrained("RVN/BERTovski") # Tensorflow
26
+ ```
27
+
28
+ # Data
29
+
30
+ For training, we used all Bulgarian and Macedonian data that was present in the [MaCoCu](https://macocu.eu/), Oscar, mc4 and Wikipedia corpora. In a manual analysis we found that for Oscar and mc4, if the data did not come from the corresponding domain (.bg or .mk), it was often (badly) machine translated. Therefore, we opted to only use data that originally came from a .bg or .mk domain.
31
+
32
+ After de-duplicating the data, we were left with a total of 54.5 GB of Bulgarian and 9 GB of Macedonian text. Since there was quite a bit more Bulgarian data, we simply doubled the Macedonian data during training. We trained a shared vocabulary of 32,000 pieces on a subset of the data in which the Bulgarian/Macedonian split was 50/50.
33
+
34
+ # Benchmark performance
35
+
36
+ We tested performance of BERTovski on benchmarks of XPOS, UPOS and NER. For Bulgarian, we used the data from the [Universal Dependencies](http://nl.ijs.si/nikola/macocu/bertovski.tgz) project. For Macedonian, we used the data sets created in the [babushka-bench](https://github.com/clarinsi/babushka-bench/) project. We compare performance to the strong multi-lingual models XLMR-base and XLMR-large. For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).
37
+
38
+ Scores are averages of three runs. We use the same hyperparameter settings for all models.
39
+
40
+ ## Bulgarian
41
+
42
+ | | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **NER** | **NER** |
43
+ |-----------------|:--------:|:--------:|:--------:|:--------:|:-------:|:--------:|
44
+ | | **Dev** | **Test** | **Dev** | **Test** | **Dev** | **Test** |
45
+ | **XLM-R-base** | 99.2 | 99.4 | 98.0 | 98.3 | 93.2 | 92.9 |
46
+ | **XLM-R-large** | 99.3 | 99.4 | 97.4 | 97.7 | 93.7 | 93.5 |
47
+ | **BERTovski** | 98.8 | 99.1 | 97.6 | 97.8 | 93.5 | 93.3 |
48
+
49
+ ## Macedonian
50
+
51
+ | | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **NER** | **NER** |
52
+ |-----------------|:--------:|:--------:|:--------:|:--------:|:-------:|:--------:|
53
+ | | **Dev** | **Test** | **Dev** | **Test** | **Dev** | **Test** |
54
+ | **XLM-R-base** | 98.3 | 98.6 | 97.3 | 97.1 | 92.8 | 94.8 |
55
+ | **XLM-R-large** | 98.3 | 98.7 | 97.7 | 97.5 | 93.3 | 95.1 |
56
+ | **BERTovski** | 97.8 | 98.1 | 96.4 | 96.0 | 92.8 | 94.6 |
57
+
58
+ # Citation
59
+
60
+ If you use this model, please cite the following paper:
61
+
62
+ ```bibtex
63
+ @inproceedings{non-etal-2022-macocu,
64
+ title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
65
+ author = "Ba{\~n}{\'o}n, Marta and
66
+ Espl{\`a}-Gomis, Miquel and
67
+ Forcada, Mikel L. and
68
+ Garc{\'\i}a-Romero, Cristian and
69
+ Kuzman, Taja and
70
+ Ljube{\v{s}}i{\'c}, Nikola and
71
+ van Noord, Rik and
72
+ Sempere, Leopoldo Pla and
73
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
74
+ Rupnik, Peter and
75
+ Suchomel, V{\'\i}t and
76
+ Toral, Antonio and
77
+ van der Werff, Tobias and
78
+ Zaragoza, Jaume",
79
+ booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
80
+ month = jun,
81
+ year = "2022",
82
+ address = "Ghent, Belgium",
83
+ publisher = "European Association for Machine Translation",
84
+ url = "https://aclanthology.org/2022.eamt-1.41",
85
+ pages = "303--304"
86
+ }
87
+ ```