iliemihai's picture
Update README.md
ee5fb9c verified
|
raw
history blame
No virus
4.24 kB
metadata
language: ro
tags:
  - bert
  - fill-mask
license: mit

sentence-bert-base-romanian-uncased-v1

The BERT base, uncased model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) v1.0

How to use

from transformers import AutoTokenizer, AutoModel
import torch

# load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained("iliemihai/sentence-bert-base-romanian-uncased-v1", do_lower_case=True)
model = AutoModel.from_pretrained("dumitrescustefan/bert-base-romanian-uncased-v1")

# tokenize a sentence and run through the model
input_ids = torch.tensor(tokenizer.encode("Acesta este un test.", add_special_tokens=True)).unsqueeze(0)  # Batch size 1
outputs = model(input_ids)

# get encoding
last_hidden_states = outputs[0]  # The last hidden-state is the first element of the output tuple

Alternative use

from sentence_transformers import SentenceTransformer
import numpy as np

# Inițializăm modelul
model = SentenceTransformer("iliemihai/sentence-bert-base-romanian-uncased-v1")

# Definim propozițiile
sentences = [
    "Un tren își începe călătoria către destinație.",
    "O locomotivă pornește zgomotos spre o stație îndepărtată.",
    "Un muzician cântă la un saxofon impresionant.",
    "Un saxofonist evocă melodii suave sub lumina lunii.",
    "O bucătăreasă presară condimente pe un platou cu legume.",
    "Un chef adaugă un strop de mirodenii peste o salată colorată.",
    "Un jongler își aruncă mingile colorate în aer.",
    "Un artist de circ jonglează cu măiestrie sub reflectoare.",
    "Un artist pictează un peisaj minunat pe o pânză albă.",
    "Un pictor redă frumusețea naturii pe pânza sa strălucitoare."
]

# Obținem embeddings pentru fiecare propoziție
embeddings = model.encode(sentences)

# Calculăm similaritatea semantică folosind similaritatea cosine
similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :])

# Afisăm similaritatea dintre propozitii
for i in range(len(sentences)):
    for j in range(len(sentences)):
        print(f"Similaritate între '{sentences[i]}' și '{sentences[j]}': {similarities[i, j]:.4f}")

Remember to always sanitize your text! Replace s and t cedilla-letters to comma-letters with :

text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")

because the model was NOT trained on cedilla s and ts. If you don't, you will have decreased performance due to <UNK>s and increased number of tokens per word.

Parameters:

Parameter Value
Batch size 16
Training steps 256k
Warmup steps 500
Uncased True
Max. Seq. Length 512
Loss function Contrastive Loss

Evaluation

Evaluation is performed on Romaian STSb dataset

Model Spearman Pearson
bert-base-romanian-uncased-v1 0.8086 0.8159
sentence-bert-base-romanian-uncased-v1 0.84 0.84

Corpus

Pretraining

The model is trained on the following corpora (stats in the table below are after cleaning):

Corpus Lines(M) Words(M) Chars(B) Size(GB)
OPUS 55.05 635.04 4.045 3.8
OSCAR 33.56 1725.82 11.411 11
Wikipedia 1.54 60.47 0.411 0.4
Total 90.15 2421.33 15.867 15.2

Finetuning

The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs).

Citation

Paper coming soon

Acknowledgements