metadata
language: ro
tags:
- bert
- fill-mask
license: mit
bert-base-romanian-uncased-v1
The BERT base, uncased model for Romanian, trained on a 15GB corpus, version
How to use
from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("iliemihai/sentence-bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
Remember to always sanitize your text! Replace s
and t
cedilla-letters to comma-letters with :
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
because the model was NOT trained on cedilla s
and t
s. If you don't, you will have decreased performance due to <UNK>
s and increased number of tokens per word.
Parameters:
Parameter | Value |
---|---|
Batch size | 16 |
Training steps | 256k |
Warmup steps | 500 |
Uncased | True |
Max. Seq. Length | 512 |
Evaluation
Evaluation is performed on Romaian STSb dataset
Model | Spearman | Pearson |
---|---|---|
bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 |
sentence-bert-base-romanian-uncased-v1 | 0.84 | 0.84 |
Corpus
Pretraining
The model is trained on the following corpora (stats in the table below are after cleaning):
Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
---|---|---|---|---|
OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
Total | 90.15 | 2421.33 | 15.867 | 15.2 |
Finetuning
The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian).
Citation
If you use this model in a research paper, I'd kindly ask you to cite the following paper:
Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
or, in bibtex:
@inproceedings{dumitrescu-etal-2020-birth,
title = "The birth of {R}omanian {BERT}",
author = "Dumitrescu, Stefan and
Avram, Andrei-Marius and
Pyysalo, Sampo",
booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
month = nov,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.findings-emnlp.387",
doi = "10.18653/v1/2020.findings-emnlp.387",
pages = "4324--4328",
}
Acknowledgements
- We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!