How to use this model directly from the
tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1") model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1")
The BERT base, cased model for Romanian, trained on a 15GB corpus, version
from transformers import AutoTokenizer, AutoModel import torch # load tokenizer and model tokenizer = AutoTokenizer.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1") model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-cased-v1") # tokenize a sentence and run through the model input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0) # Batch size 1 outputs = model(input_ids) # get encoding last_hidden_states = outputs # The last hidden-state is the first element of the output tuple
Evaluation is performed on Universal Dependencies Romanian RRT UPOS, XPOS and LAS, and on a NER task based on RONEC. Details, as well as more in-depth tests not shown here, are given in the dedicated evaluation page.
The baseline is the Multilingual BERT model
bert-base-multilingual-(un)cased, as at the time of writing it was the only available BERT model that works on Romanian.
The model is trained on the following corpora (stats in the table below are after cleaning):