File size: 4,726 Bytes
e4a5a9c 38b96c4 e4a5a9c 364ef40 81fd835 364ef40 3b2670e b460ae6 364ef40 e4a5a9c 364ef40 e4a5a9c 364ef40 e4a5a9c 3b2670e 0dc011a 3b2670e e4a5a9c 5fb80fd e4a5a9c 364ef40 3b2670e 0dc011a 3b2670e 364ef40 7bd01ab 364ef40 0dc011a 3b2670e 364ef40 7bd01ab 364ef40 3b2670e 0dc011a 3b2670e 364ef40 e4a5a9c 3525417 e4a5a9c 3525417 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 |
Model card for RoBERT-small
---
language:
- ro
---
# RoBERT-small
## Pretrained BERT model for Romanian
Pretrained model on Romanian language using a masked language modeling (MLM) and next sentence prediction (NSP) objective.
It was introduced in this [paper](https://www.aclweb.org/anthology/2020.coling-main.581/). Three BERT models were released: **RoBERT-small**, RoBERT-base and RoBERT-large, all versions uncased.
| Model | Weights | L | H | A | MLM accuracy | NSP accuracy |
|----------------|:---------:|:------:|:------:|:------:|:--------------:|:--------------:|
| *RoBERT-small* | *19M* | *12* | *256* | *8* | *0.5363* | *0.9687* |
| RoBERT-base | 114M | 12 | 768 | 12 | 0.6511 | 0.9802 |
| RoBERT-large | 341M | 24 | 1024 | 24 | 0.6929 | 0.9843 |
All models are available:
* [RoBERT-small](https://huggingface.co/readerbench/RoBERT-small)
* [RoBERT-base](https://huggingface.co/readerbench/RoBERT-base)
* [RoBERT-large](https://huggingface.co/readerbench/RoBERT-large)
#### How to use
```python
# tensorflow
from transformers import AutoModel, AutoTokenizer, TFAutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
model = TFAutoModel.from_pretrained("readerbench/RoBERT-small")
inputs = tokenizer("exemplu de propoziție", return_tensors="tf")
outputs = model(inputs)
# pytorch
from transformers import AutoModel, AutoTokenizer, AutoModel
tokenizer = AutoTokenizer.from_pretrained("readerbench/RoBERT-small")
model = AutoModel.from_pretrained("readerbench/RoBERT-small")
inputs = tokenizer("exemplu de propoziție", return_tensors="pt")
outputs = model(**inputs)
```
## Training data
The model is trained on the following compilation of corpora. Note that we present the statistics after the cleaning process.
| Corpus | Words | Sentences | Size (GB)|
|-----------|:---------:|:---------:|:--------:|
| Oscar | 1.78B | 87M | 10.8 |
| RoTex | 240M | 14M | 1.5 |
| RoWiki | 50M | 2M | 0.3 |
| **Total** | **2.07B** | **103M** | **12.6** |
## Downstream performance
### Sentiment analysis
We report Macro-averaged F1 score (in %)
| Model | Dev | Test |
|------------------|:--------:|:--------:|
| multilingual-BERT| 68.96 | 69.57 |
| XLM-R-base | 71.26 | 71.71 |
| BERT-base-ro | 70.49 | 71.02 |
| *RoBERT-small* | *66.32* | *66.37* |
| RoBERT-base | 70.89 | 71.61 |
| RoBERT-large | **72.48**| **72.11**|
### Moldavian vs. Romanian Dialect and Cross-dialect Topic identification
We report results on [VarDial 2019](https://sites.google.com/view/vardial2019/campaign) Moldavian vs. Romanian Cross-dialect Topic identification Challenge, as Macro-averaged F1 score (in %).
| Model | Dialect Classification | MD to RO | RO to MD |
|-------------------|:----------------------:|:--------:|:--------:|
| 2-CNN + SVM | 93.40 | 65.09 | 75.21 |
| Char+Word SVM | 96.20 | 69.08 | 81.93 |
| BiGRU | 93.30 | **70.10**| 80.30 |
| multilingual-BERT | 95.34 | 68.76 | 78.24 |
| XLM-R-base | 96.28 | 69.93 | 82.28 |
| BERT-base-ro | 96.20 | 69.93 | 78.79 |
| *RoBERT-small* | *95.67* | *69.01* | *80.40* |
| RoBERT-base | 97.39 | 68.30 | 81.09 |
| RoBERT-large | **97.78** | 69.91 | **83.65**|
### Diacritics Restoration
Challenge can be found [here](https://diacritics-challenge.speed.pub.ro/). We report results on the official test set, as accuracies in %.
| Model | word level | char level |
|-----------------------------|:----------:|:----------:|
| BiLSTM | 99.42 | - |
| CharCNN | 98.40 | 99.65 |
| CharCNN + multilingual-BERT | 99.72 | 99.94 |
| CharCNN + XLM-R-base | 99.76 | **99.95** |
| CharCNN + BERT-base-ro | **99.79** | **99.95** |
| *CharCNN + RoBERT-small* | *99.73* | *99.94* |
| CharCNN + RoBERT-base | 99.78 | **99.95** |
| CharCNN + RoBERT-large | 99.76 | **99.95** |
### BibTeX entry and citation info
```bibtex
@inproceedings{masala2020robert,
title={RoBERT--A Romanian BERT Model},
author={Masala, Mihai and Ruseti, Stefan and Dascalu, Mihai},
booktitle={Proceedings of the 28th International Conference on Computational Linguistics},
pages={6626--6637},
year={2020}
}
```
|