File size: 4,390 Bytes
bc20941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b5b1651
bc20941
775fad2
bc20941
b5b1651
 
 
 
 
 
 
bc20941
62ce1e2
 
5141ab6
 
62ce1e2
bc20941
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
---
license: cc0-1.0
language:
- tr
tags:
- MaCoCu
---

# Model description

**XLMR-MaCoCu-tr** is a large pre-trained language model trained on **Turkish** texts. It was created by continuing training from the [XLM-RoBERTa-large](https://huggingface.co/xlm-roberta-large) model. It was developed as part of the [MaCoCu](https://macocu.eu/) project and only uses data that was crawled during the project. The main developer is [Rik van Noord](https://www.rikvannoord.nl/) from the University of Groningen.

XLMR-MaCoCu-tr was trained on 35GB of Turkish text, which is equal to 4.4B tokens. It was trained for 70,000 steps with a batch size of 1,024. It uses the same vocabulary as the original XLMR-large model.

The training and fine-tuning procedures are described in detail on our [Github repo](https://github.com/macocu/LanguageModels).

# How to use

```python
from transformers import AutoTokenizer, AutoModel, TFAutoModel

tokenizer = AutoTokenizer.from_pretrained("RVN/XLMR-MaCoCu-tr")
model = AutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr") # PyTorch
model = TFAutoModel.from_pretrained("RVN/XLMR-MaCoCu-tr") # Tensorflow
```

# Data

For training, we used all Turkish data that was present in the monolingual Turkish [MaCoCu](https://macocu.eu/) corpus. After de-duplicating the data, we were left with a total of 35 GB of text, which equals 4.4 billion tokens.

# Benchmark performance

We tested the performance of **XLMR-MaCoCu-tr** on benchmarks of XPOS, UPOS and NER from the [Universal Dependencies](https://universaldependencies.org/) project. For COPA, we train on a machine translated (MT) set of the data (for details see our [Github repo](https://github.com/RikVN/COPA)), and evaluate on a similar MT set, but also on the human-translated (HT) test set from the [XCOPA](https://github.com/cambridgeltl/xcopa) project. We compare performance to the strong multi-lingual models XLMR-base and XLMR-large, but also to the monolingual [BERTurk](https://huggingface.co/dbmdz/bert-base-turkish-cased) model. For details regarding the fine-tuning procedure you can checkout our [Github](https://github.com/macocu/LanguageModels).

Scores are averages of three runs, except for COPA, for which we use 10 runs. We use the same hyperparameter settings for all models for POS/NER, for COPA we optimized each learning rate on the dev set.

|                    | **UPOS** | **UPOS** | **XPOS** | **XPOS** | **NER** | **NER**  | **COPA** | **COPA** |
|--------------------|:--------:|:--------:|:--------:|:--------:|---------|----------| ----------| ----------|
|                    |  **Dev** | **Test** |  **Dev** | **Test** | **Dev** | **Test** |  **Test (MT)** | **Test (HT)** |
| **XLM-R-base**     |   89.0   |   89.0   |   90.4   |   90.6   |   92.8  |   92.6   | 56.0 |   56.4  |
| **XLM-R-large**    |   89.4   |   89.3   |   90.8   |   90.7   |   94.1  |   94.1   | 52.1 |   53.2  |
| **BERTurk**        |   88.2   |   88.4   |   89.7   |   89.6   |   92.6  |   92.6   | 57.0 |     |
| **XLMR-MaCoCu-tr** |   89.1   |   89.4   |   90.7   |   90.5   |   94.4  |   94.4   | 60.7 |     |

# Acknowledgements

Research supported with Cloud TPUs from Google's TPU Research Cloud (TRC). The authors received funding from the European Union’s Connecting Europe Facility 2014-
2020 - CEF Telecom, under Grant Agreement No.INEA/CEF/ICT/A2020/2278341 (MaCoCu).

# Citation

If you use this model, please cite the following paper:

```bibtex
@inproceedings{non-etal-2022-macocu,
    title = "{M}a{C}o{C}u: Massive collection and curation of monolingual and bilingual data: focus on under-resourced languages",
    author = "Ba{\~n}{\'o}n, Marta  and
      Espl{\`a}-Gomis, Miquel  and
      Forcada, Mikel L.  and
      Garc{\'\i}a-Romero, Cristian  and
      Kuzman, Taja  and
      Ljube{\v{s}}i{\'c}, Nikola  and
      van Noord, Rik  and
      Sempere, Leopoldo Pla  and
      Ram{\'\i}rez-S{\'a}nchez, Gema  and
      Rupnik, Peter  and
      Suchomel, V{\'\i}t  and
      Toral, Antonio  and
      van der Werff, Tobias  and
      Zaragoza, Jaume",
    booktitle = "Proceedings of the 23rd Annual Conference of the European Association for Machine Translation",
    month = jun,
    year = "2022",
    address = "Ghent, Belgium",
    publisher = "European Association for Machine Translation",
    url = "https://aclanthology.org/2022.eamt-1.41",
    pages = "303--304"
}
```