iliemihai
/

romanian-sentence-bert-base-uncased-v1

feature-extraction

Inference Endpoints

Model card Files Files and versions Community

romanian-sentence-bert-base-uncased-v1 / README.md

iliemihai's picture

Update README.md

4166f33 verified 8 months ago

|

No virus

3.52 kB

	---
	language: ro
	tags:
	- bert
	- fill-mask
	license: mit
	---

	# bert-base-romanian-uncased-v1

	The BERT base, uncased model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)

	### How to use

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch

	# load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("iliemihai/sentence-bert-base-romanian-uncased-v1", do_lower_case=True)
	model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")

	# tokenize a sentence and run through the model
	input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
	outputs = model(input_ids)

	# get encoding
	last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
	```

	Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
	```
	text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
	```
	because the model was NOT trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.

	### Parameters:


	\| Parameter \| Value \|
	\|------------------\|-------\|
	\| Batch size \| 16 \|
	\| Training steps \| 256k \|
	\| Warmup steps \| 500 \|
	\| Uncased \| True \|
	\| Max. Seq. Length \| 512 \|


	### Evaluation

	Evaluation is performed on Romaian STSb dataset


	\| Model \| Spearman \| Pearson \|
	\|--------------------------------\|:-----:\|:------:\|
	\| bert-base-romanian-uncased-v1 \| 0.8086 \| 0.8159 \|
	\| sentence-bert-base-romanian-uncased-v1 \| 0.84 \| 0.84 \|

	### Corpus

	#### Pretraining

	The model is trained on the following corpora (stats in the table below are after cleaning):

	\| Corpus \| Lines(M) \| Words(M) \| Chars(B) \| Size(GB) \|
	\|-----------\|:--------:\|:--------:\|:--------:\|:--------:\|
	\| OPUS \| 55.05 \| 635.04 \| 4.045 \| 3.8 \|
	\| OSCAR \| 33.56 \| 1725.82 \| 11.411 \| 11 \|
	\| Wikipedia \| 1.54 \| 60.47 \| 0.411 \| 0.4 \|
	\| Total \| 90.15 \| 2421.33 \| 15.867 \| 15.2 \|

	#### Finetuning

	The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian).

	### Citation

	If you use this model in a research paper, I'd kindly ask you to cite the following paper:

	```
	Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
	```

	or, in bibtex:

	```
	@inproceedings{dumitrescu-etal-2020-birth,
	title = "The birth of {R}omanian {BERT}",
	author = "Dumitrescu, Stefan and
	Avram, Andrei-Marius and
	Pyysalo, Sampo",
	booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
	month = nov,
	year = "2020",
	address = "Online",
	publisher = "Association for Computational Linguistics",
	url = "https://aclanthology.org/2020.findings-emnlp.387",
	doi = "10.18653/v1/2020.findings-emnlp.387",
	pages = "4324--4328",
	}
	```

	#### Acknowledgements

	- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!