|
--- |
|
language: ro |
|
tags: |
|
- bert |
|
- fill-mask |
|
license: mit |
|
--- |
|
|
|
# sentence-bert-base-romanian-uncased-v1 |
|
|
|
The BERT **base**, **uncased** model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666) |
|
|
|
### How to use |
|
|
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
import numpy as np |
|
|
|
# Inițializăm modelul |
|
model = SentenceTransformer("iliemihai/sentence-bert-base-romanian-uncased-v1") |
|
|
|
# Definim propozițiile |
|
sentences = [ |
|
"Un tren își începe călătoria către destinație.", |
|
"O locomotivă pornește zgomotos spre o stație îndepărtată.", |
|
"Un muzician cântă la un saxofon impresionant.", |
|
"Un saxofonist evocă melodii suave sub lumina lunii.", |
|
"O bucătăreasă presară condimente pe un platou cu legume.", |
|
"Un chef adaugă un strop de mirodenii peste o salată colorată.", |
|
"Un jongler își aruncă mingile colorate în aer.", |
|
"Un artist de circ jonglează cu măiestrie sub reflectoare.", |
|
"Un artist pictează un peisaj minunat pe o pânză albă.", |
|
"Un pictor redă frumusețea naturii pe pânza sa strălucitoare." |
|
] |
|
|
|
# Obținem embeddings pentru fiecare propoziție |
|
embeddings = model.encode(sentences) |
|
|
|
# Calculăm similaritatea semantică folosind similaritatea cosine |
|
similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :]) |
|
|
|
# Identificăm cea mai similară propoziție pentru fiecare propoziție, excluzând similaritatea cu sine însăși |
|
most_similar_indices = np.argmax(similarities - np.eye(len(sentences)), axis=1) |
|
|
|
most_similar_sentences = [(sentences[i], sentences[most_similar_indices[i]], similarities[i, most_similar_indices[i]]) for i in range(len(sentences))] |
|
|
|
print(most_similar_sentences) |
|
``` |
|
|
|
Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with : |
|
``` |
|
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș") |
|
``` |
|
because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word. |
|
|
|
### Parameters: |
|
|
|
|
|
| Parameter | Value | |
|
|------------------|-------| |
|
| Batch size | 16 | |
|
| Training steps | 256k | |
|
| Warmup steps | 500 | |
|
| Uncased | True | |
|
| Max. Seq. Length | 512 | |
|
| Loss function | Contrastive Loss | |
|
|
|
### Evaluation |
|
|
|
Evaluation is performed on Romaian STSb dataset |
|
|
|
|
|
| Model | Spearman | Pearson | |
|
|--------------------------------|:-----:|:------:| |
|
| bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 | |
|
| sentence-bert-base-romanian-uncased-v1 | **0.8393** | **0.8387** | |
|
|
|
### Corpus |
|
|
|
#### Pretraining |
|
|
|
The model is trained on the following corpora (stats in the table below are after cleaning): |
|
|
|
| Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) | |
|
|-----------|:--------:|:--------:|:--------:|:--------:| |
|
| OPUS | 55.05 | 635.04 | 4.045 | 3.8 | |
|
| OSCAR | 33.56 | 1725.82 | 11.411 | 11 | |
|
| Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 | |
|
| **Total** | **90.15** | **2421.33** | **15.867** | **15.2** | |
|
|
|
#### Finetuning |
|
|
|
The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs). |
|
|
|
### Citation |
|
|
|
Paper coming soon |
|
|
|
|
|
#### Acknowledgements |
|
|
|
- We'd like to thank [Stefan Dumitrescu](https://github.com/dumitrescustefan) and [Andrei Marius Avram](https://github.com/avramandrei) for pretraining the v1.0 BERT models! |
|
|