julien-c's picture
julien-c HF staff
Migrate model card from transformers-repo
829db6e
metadata
language: ro

bert-base-romanian-cased-v1

The BERT base, cased model for Romanian, trained on a 15GB corpus, version v1.0

How to use

from transformers import AutoTokenizer, AutoModel
import torch
# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)
# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Evaluation

Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.

The baseline is the Multilingual BERT model bert-base-multilingual-(un)cased, as at the time of writing it was the only available BERT model that works on Romanian.

Model UPOS XPOS NER LAS
bert-base-multilingual-cased 97.87 96.16 84.13 88.04
bert-base-romanian-cased-v1 98.00 96.46 85.88 89.69

Corpus

The model is trained on the following corpora (stats in the table below are after cleaning):

| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) | |----------- |:--------: |:--------: |:--------: |:--------: | | OPUS | 55.05 | 635.04 | 4.045 | 3.8 | | OSCAR | 33.56 | 1725.82 | 11.411 | 11 | | Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 | | Total | 90.15 | 2421.33 | 15.867 | 15.2 |

Acknowledgements

  • We'd like to thank Sampo Pyysalo from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!