RVN commited on
Commit
aa0cfee
1 Parent(s): 68f12d7

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +72 -0
README.md CHANGED
@@ -1,3 +1,75 @@
1
  ---
2
  license: cc0-1.0
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: cc0-1.0
3
+ language:
4
+ - mt
5
+ tags:
6
+ - MaltBERTa
7
+ - MaCoCu
8
  ---
9
+
10
+ # Model description
11
+
12
+ **MaltBERTa** is a large pre-trained language model trained on Maltese texts. It was trained from scratch using the RoBERTa architecture. It was developed as part of the [MaCoCu](https://macocu.eu/) project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.
13
+
14
+ MaltBERTa was trained on 3.2GB of text, which is equal to 439M tokens. It was trained for 100,000 steps with a batch size of 1,024.
15
+
16
+ The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).
17
+ # How to use
18
+
19
+ ```python
20
+ from transformers import AutoTokenizer, AutoModel, TFAutoModel
21
+
22
+ tokenizer = AutoTokenizer.from_pretrained("RVN/MaltBERTa")
23
+ model = AutoModel.from_pretrained("RVN/MaltBERTa") # PyTorch
24
+ model = TFAutoModel.from_pretrained("RVN/MaltBERTa") # Tensorflow
25
+ ```
26
+
27
+ # Data
28
+
29
+ For training, we used all Maltese data that was present in the [MaCoCu](https://macocu.eu/), Oscar and mc4 corpora. After de-duplicating the data, we were left with a total of 3.2GB of text. We ran experiments with only training on data that came from the .mt domain in Oscar and mc4, but got better performance by incorporating all data.
30
+
31
+ # Benchmark performance
32
+
33
+ We tested the performance of MaltBERTa on the UPOS and XPOS benchmark of the [Universal Dependencies](https://universaldependencies.org/) project. We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, though note that Maltese was not one of the training languages for those models. We also compare to the recently introduced Maltese language models [BERTu](https://huggingface.co/MLRS/BERTu) and [mBERTu](https://huggingface.co/MLRS/mBERTu). For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).
34
+
35
+ Scores are averages of three runs. We use the same hyperparameter settings for all models.
36
+
37
+ | | **UPOS** | **UPOS** | **XPOS** | **XPOS** |
38
+ |-----------------|:--------:|:--------:|:--------:|:--------:|
39
+ | | **Dev** | **Test** | **Dev** | **Test** |
40
+ | **XLM-R-base** | 93.6 | 93.2 | 93.4 | 93.2 |
41
+ | **XLM-R-large** | 94.9 | 94.4 | 95.1 | 94.7 |
42
+ | **BERTu** | 97.5 | 97.6 | 95.7 | 95.8 |
43
+ | **mBERTu** | 97.7 | 97.8 | 97.9 | 98.1 |
44
+ | **MaltBERTa** | 95.7 | 95.8 | 96.1 | 96.0 |
45
+
46
+ # Citation
47
+
48
+ If you use this model, please cite the following paper:
49
+
50
+ ```bibtex
51
+ @inproceedings{non-etal-2022-macocu,
52
+ title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
53
+ author = "Ba{\~n}{\'o}n, Marta and
54
+ Espl{\`a}-Gomis, Miquel and
55
+ Forcada, Mikel L. and
56
+ Garc{\'\i}a-Romero, Cristian and
57
+ Kuzman, Taja and
58
+ Ljube{\v{s}}i{\'c}, Nikola and
59
+ van Noord, Rik and
60
+ Sempere, Leopoldo Pla and
61
+ Ram{\'\i}rez-S{\'a}nchez, Gema and
62
+ Rupnik, Peter and
63
+ Suchomel, V{\'\i}t and
64
+ Toral, Antonio and
65
+ van der Werff, Tobias and
66
+ Zaragoza, Jaume",
67
+ booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
68
+ month = jun,
69
+ year = "2022",
70
+ address = "Ghent, Belgium",
71
+ publisher = "European Association for Machine Translation",
72
+ url = "https://aclanthology.org/2022.eamt-1.41",
73
+ pages = "303--304"
74
+ }
75
+ ```