File size: 3,517 Bytes
3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 4166f33 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b 3507a9e 9aa5f5b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 |
---
language: ro
tags:
- bert
- fill-mask
license: mit
---
# bert-base-romanian-uncased-v1
The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
### How to use
```python
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("iliemihai/sentence-bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
```
Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
```
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
```
because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.
### Parameters:
| Parameter | Value |
|------------------|-------|
| Batch size | 16 |
| Training steps | 256k |
| Warmup steps | 500 |
| Uncased | True |
| Max. Seq. Length | 512 |
### Evaluation
Evaluation is performed on Romaian STSb dataset
| Model | Spearman | Pearson |
|--------------------------------|:-----:|:------:|
| bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 |
| sentence-bert-base-romanian-uncased-v1 | **0.84** | **0.84** |
### Corpus
#### Pretraining
The model is trained on the following corpora (stats in the table below are after cleaning):
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
|-----------|:--------:|:--------:|:--------:|:--------:|
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
#### Finetuning
The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian).
### Citation
If you use this model in a research paper, I'd kindly ask you to cite the following paper:
```
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
```
or, in bibtex:
```
@inproceedings{dumitrescu-etal-2020-birth,
title = "The birth of {R}omanian {BERT}",
author = "Dumitrescu, Stefan and
Avram, Andrei-Marius and
Pyysalo, Sampo",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.findings-emnlp.387",
doi = "10.18653/v1/2020.findings-emnlp.387",
pages = "4324--4328",
}
```
#### Acknowledgements
- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
|