dumitrescustefan
/

bert-base-romanian-cased-v1

Fill-Mask Transformers PyTorch JAX Romanian bert Inference Endpoints

Model card Files Files and versions Community

bert-base-romanian-cased-v1 / README.md

julien-c's picture

julien-c HF staff

Migrate model card from transformers-repo

829db6e over 3 years ago

|

raw history blame

No virus

2.47 kB

	---
	language: ro
	---

	# bert-base-romanian-cased-v1

	The BERT base, cased model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)

	### How to use

	```python
	from transformers import AutoTokenizer, AutoModel
	import torch
	# load tokenizer and model
	tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
	model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
	# tokenize a sentence and run through the model
	input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1
	outputs = model(input_ids)
	# get encoding
	last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
	```

	### Evaluation

	Evaluation is performed on Universal Dependencies [Romanian RRT](https://universaldependencies.org/treebanks/ro_rrt/index.html) UPOS, XPOS and LAS, and on a NER task based on [RONEC](https://github.com/dumitrescustefan/ronec). Details, as well as more in-depth tests not shown here, are given in the dedicated [evaluation page](https://github.com/dumitrescustefan/Romanian-Transformers/tree/master/evaluation/README.md).

	The baseline is the [Multilingual BERT](https://github.com/google-research/bert/blob/master/multilingual.md) model ``bert-base-multilingual-(un)cased``, as at the time of writing it was the only available BERT model that works on Romanian.

	\| Model \| UPOS \| XPOS \| NER \| LAS \|
	\|--------------------------------\|:-----:\|:------:\|:-----:\|:-----:\|
	\| bert-base-multilingual-cased \| 97.87 \| 96.16 \| 84.13 \| 88.04 \|
	\| bert-base-romanian-cased-v1 \| 98.00 \| 96.46 \| 85.88 \| 89.69 \|

	### Corpus

	The model is trained on the following corpora (stats in the table below are after cleaning):

	\| Corpus \| Lines(M) \| Words(M) \| Chars(B) \| Size(GB) \|
	\|----------- \|:--------: \|:--------: \|:--------: \|:--------: \|
	\| OPUS \| 55.05 \| 635.04 \| 4.045 \| 3.8 \|
	\| OSCAR \| 33.56 \| 1725.82 \| 11.411 \| 11 \|
	\| Wikipedia \| 1.54 \| 60.47 \| 0.411 \| 0.4 \|
	\| Total \| 90.15 \| 2421.33 \| 15.867 \| 15.2 \|

	#### Acknowledgements

	- We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!