File size: 4,720 Bytes
09d62f9
5d01069
537489f
 
 
 
5d01069
b4a9f6b
537489f
 
 
 
 
60404f8
537489f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d01069
 
 
 
 
537489f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d01069
 
 
 
537489f
5d01069
537489f
 
 
 
 
 
5d01069
 
537489f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5d01069
 
 
 
cea027b
bb7dc16
 
 
 
 
5d01069
 
cea027b
537489f
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
Model card for RoBERT-base

---
language: 
- ro
---

# RoBERT-base


## Pretrained BERT model for Romanian 

Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective. 
It was introduced in this [paper](https://www.aclweb.org/anthology/2020.coling-main.581/). Three BERT models were released: RoBERT-small, **RoBERT-base** and RoBERT-large, all versions uncased.

| Model          | Weights   |   L    |   H    |    A   | MLM accuracy   | NSP accuracy   |
|----------------|:---------:|:------:|:------:|:------:|:--------------:|:--------------:|
| RoBERT-small | 19M     | 12   | 256  | 8    | 0.5363       | 0.9687       |
| *RoBERT-base*    | *114M*      | *12*     | *768*    | *12*     | *0.6511*         | *0.9802*         |
| RoBERT-large   | 341M      | 24     | 1024   | 24     | 0.6929         | 0.9843         |




All models are available:

* [RoBERT-small](https://huggingface.co/readerbench/RoBERT-small)
* [RoBERT-base](https://huggingface.co/readerbench/RoBERT-base)
* [RoBERT-large](https://huggingface.co/readerbench/RoBERT-large)



#### How to use

```python
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)

# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-base")
model = AutoModel.from_pretrained("readerbench/RoBERT-base")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)
```


## Training data

The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.

| Corpus    | Words     | Sentences | Size (GB)|
|-----------|:---------:|:---------:|:--------:|
| Oscar     | 1.78B     | 87M       | 10.8     |
| RoTex     | 240M      | 14M       | 1.5      |
| RoWiki    | 50M       | 2M        | 0.3      |
| **Total** | **2.07B** | **103M**  | **12.6** |


## Downstream performance

### Sentiment analysis

We report Macro-averaged F1 score (in %)

| Model            | Dev      | Test     |
|------------------|:--------:|:--------:|
| multilingual-BERT| 68.96    | 69.57    |
| XLM-R-base       | 71.26    | 71.71    |
| BERT-base-ro     | 70.49    | 71.02    |
| RoBERT-small     | 66.32    | 66.37    |
| *RoBERT-base*    | *70.89*  | *71.61*  |
| RoBERT-large     | **72.48**| **72.11**|

### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification

We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %).

| Model             | Dialect Classification | MD to RO | RO to MD |
|-------------------|:----------------------:|:--------:|:--------:|
| 2-CNN + SVM       | 93.40                  | 65.09    | 75.21    |
| Char+Word SVM     | 96.20                  | 69.08    | 81.93    |
| BiGRU             | 93.30                  | **70.10**| 80.30    |
| multilingual-BERT | 95.34                  | 68.76    | 78.24    |
| XLM-R-base        | 96.28                  | 69.93    | 82.28    |
| BERT-base-ro      | 96.20                  | 69.93    | 78.79    |
| RoBERT-small      | 95.67                  | 69.01    | 80.40    |
| *RoBERT-base*     | *97.39*                | *68.30*  | *81.09*  |
| RoBERT-large      | **97.78**              | 69.91    | **83.65**|

### Diacritics Restoration

Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/). We report results on the official test set, as accuracies in %.

| Model                       | word level | char level |
|-----------------------------|:----------:|:----------:|
| BiLSTM                      | 99.42      | -          |
| CharCNN                     | 98.40      | 99.65      |
| CharCNN + multilingual-BERT | 99.72      | 99.94      |
| CharCNN + XLM-R-base        | 99.76      | **99.95**  |
| CharCNN + BERT-base-ro      | **99.79**  | **99.95**  |
| CharCNN + RoBERT-small      | 99.73      | 99.94      |
| *CharCNN + RoBERT-base*     | *99.78*    | **99.95**  |
| CharCNN + RoBERT-large      | 99.76      | **99.95**  |


### BibTeX entry and citation info

```bibtex
@inproceedings{masala2020robert,
  title={RoBERT--A Romanian BERT Model},
  author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai},
  booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
  pages={6626--6637},
  year={2020}
}
```