File size: 3,780 Bytes
3507a9e
9aa5f5b
3507a9e
9aa5f5b
 
 
3507a9e
 
ee5fb9c
3507a9e
ee5fb9c
3507a9e
9aa5f5b
3507a9e
ee5fb9c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4d296fc
 
 
 
 
 
ee5fb9c
 
9aa5f5b
 
 
 
 
3507a9e
4166f33
 
 
 
 
 
 
 
 
 
ee5fb9c
3507a9e
9aa5f5b
3507a9e
9aa5f5b
3507a9e
 
9aa5f5b
 
 
88b782c
3507a9e
9aa5f5b
3507a9e
9aa5f5b
3507a9e
9aa5f5b
3507a9e
9aa5f5b
 
 
 
 
 
3507a9e
9aa5f5b
3507a9e
ee5fb9c
3507a9e
9aa5f5b
3507a9e
ee5fb9c
3507a9e
 
9aa5f5b
3507a9e
ee5fb9c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
---
language: ro
tags:
- bert
- fill-mask
license: mit
---

# sentence-bert-base-romanian-uncased-v1

The BERT **base**, **uncased** model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)

### How to use

```python
from sentence_transformers import SentenceTransformer
import numpy as np

# Inițializăm modelul
model = SentenceTransformer("iliemihai/sentence-bert-base-romanian-uncased-v1")

# Definim propozițiile
sentences = [
    "Un tren își începe călătoria către destinație.",
    "O locomotivă pornește zgomotos spre o stație îndepărtată.",
    "Un muzician cântă la un saxofon impresionant.",
    "Un saxofonist evocă melodii suave sub lumina lunii.",
    "O bucătăreasă presară condimente pe un platou cu legume.",
    "Un chef adaugă un strop de mirodenii peste o salată colorată.",
    "Un jongler își aruncă mingile colorate în aer.",
    "Un artist de circ jonglează cu măiestrie sub reflectoare.",
    "Un artist pictează un peisaj minunat pe o pânză albă.",
    "Un pictor redă frumusețea naturii pe pânza sa strălucitoare."
]

# Obținem embeddings pentru fiecare propoziție
embeddings = model.encode(sentences)

# Calculăm similaritatea semantică folosind similaritatea cosine
similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :])

# Identificăm cea mai similară propoziție pentru fiecare propoziție, excluzând similaritatea cu sine însăși
most_similar_indices = np.argmax(similarities - np.eye(len(sentences)), axis=1)

most_similar_sentences = [(sentences[i], sentences[most_similar_indices[i]], similarities[i, most_similar_indices[i]]) for i in range(len(sentences))]

print(most_similar_sentences)
```

Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
```
text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
```
because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word. 

### Parameters:


| Parameter        | Value |
|------------------|-------|
| Batch size       | 16    |
| Training steps   | 256k |
| Warmup steps     | 500  |
| Uncased      | True  |
| Max. Seq. Length | 512   |
| Loss function | Contrastive Loss   |

### Evaluation

Evaluation is performed on Romaian STSb dataset


| Model                          |  Spearman |  Pearson  |
|--------------------------------|:-----:|:------:|
| bert-base-romanian-uncased-v1 | 0.8086 |  	0.8159 |
| sentence-bert-base-romanian-uncased-v1  | **0.8393** |  **0.8387** |

### Corpus 

#### Pretraining

The model is trained on the following corpora (stats in the table below are after cleaning):

| Corpus    	| Lines(M) 	| Words(M) 	| Chars(B) 	| Size(GB) 	|
|-----------|:--------:|:--------:|:--------:|:--------:|
| OPUS      	|   55.05  	|  635.04  	|   4.045  	|    3.8   	|
| OSCAR     	|   33.56  	|  1725.82 	|  11.411  	|    11    	|
| Wikipedia 	|   1.54   	|   60.47  	|   0.411  	|    0.4   	|
| **Total**     	|   **90.15**  	|  **2421.33** 	|  **15.867**  	|   **15.2**   	|

#### Finetuning

The model is finetune on the  RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs).

### Citation

Paper coming soon


#### Acknowledgements

- We'd like to thank [Stefan Dumitrescu](https://github.com/dumitrescustefan) and [Andrei Marius Avram](https://github.com/avramandrei) for pretraining the v1.0 BERT models!