--- language: ro tags: - bert - fill-mask license: mit --- # bert-base-romanian-uncased-v1 The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666) ### How to use ```python from transformers import AutoTokenizer, AutoModel import torch # load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("iliemihai/sentence-bert-base-romanian-uncased-v1", do_lower_case=True) model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1") # tokenize a sentence and run through the model input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) # get encoding last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple ``` Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with : ``` text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș") ``` because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ````s and increased number of tokens per word. ### Parameters: | Parameter | Value | |------------------|-------| | Batch size | 16 | | Training steps | 256k | | Warmup steps | 500 | | Uncased | True | | Max. Seq. Length | 512 | ### Evaluation Evaluation is performed on Romaian STSb dataset | Model | Spearman | Pearson | |--------------------------------|:-----:|:------:| | bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 | | sentence-bert-base-romanian-uncased-v1 | **0.84** | **0.84** | ### Corpus #### Pretraining The model is trained on the following corpora (stats in the table below are after cleaning): | Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) | |-----------|:--------:|:--------:|:--------:|:--------:| | OPUS | 55.05 | 635.04 | 4.045 | 3.8 | | OSCAR | 33.56 | 1725.82 | 11.411 | 11 | | Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 | | **Total** | **90.15** | **2421.33** | **15.867** | **15.2** | #### Finetuning The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian). ### Citation If you use this model in a research paper, I'd kindly ask you to cite the following paper: ``` Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics. ``` or, in bibtex: ``` @inproceedings{dumitrescu-etal-2020-birth, title = "The birth of {R}omanian {BERT}", author = "Dumitrescu, Stefan and Avram, Andrei-Marius and Pyysalo, Sampo", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020", month = nov, year = "2020", address = "Online", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2020.findings-emnlp.387", doi = "10.18653/v1/2020.findings-emnlp.387", pages = "4324--4328", } ``` #### Acknowledgements - We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!