iliemihai commited on
Commit
d0ae7a4
1 Parent(s): 9efb785

Create README.md

Browse files
Files changed (1) hide show
  1. README.md +102 -0
README.md ADDED
@@ -0,0 +1,102 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language: ro
3
+ tags:
4
+ - bert
5
+ - fill-mask
6
+ license: mit
7
+ ---
8
+
9
+ # romanian-sentence-e5-large
10
+
11
+ The BERT **base**, **uncased** model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
12
+
13
+ ### How to use
14
+
15
+ ```python
16
+ from sentence_transformers import SentenceTransformer
17
+ import numpy as np
18
+
19
+ # Inițializăm modelul
20
+ model = SentenceTransformer("iliemihai/romanian-sentence-e5-large")
21
+
22
+ # Definim propozițiile
23
+ sentences = [
24
+ "Un tren își începe călătoria către destinație.",
25
+ "O locomotivă pornește zgomotos spre o stație îndepărtată.",
26
+ "Un muzician cântă la un saxofon impresionant.",
27
+ "Un saxofonist evocă melodii suave sub lumina lunii.",
28
+ "O bucătăreasă presară condimente pe un platou cu legume.",
29
+ "Un chef adaugă un strop de mirodenii peste o salată colorată.",
30
+ "Un jongler aruncă si prinde mingi colorate.",
31
+ "Un artist de circ jonglează cu măiestrie sub reflectoare.",
32
+ "Un artist pictează un peisaj minunat pe o pânză albă.",
33
+ "Un pictor redă frumusețea naturii pe pânza sa strălucitoare."
34
+ ]
35
+
36
+ # Obținem embeddings pentru fiecare propoziție
37
+ embeddings = model.encode(sentences)
38
+
39
+ # Calculăm similaritatea semantică folosind similaritatea cosine
40
+ similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :])
41
+
42
+ # Identificăm cea mai similară propoziție pentru fiecare propoziție, excluzând similaritatea cu sine însăși
43
+ most_similar_indices = np.argmax(similarities - np.eye(len(sentences)), axis=1)
44
+
45
+ most_similar_sentences = [(sentences[i], sentences[most_similar_indices[i]], similarities[i, most_similar_indices[i]]) for i in range(len(sentences))]
46
+
47
+ print(most_similar_sentences)
48
+ ```
49
+
50
+ Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
51
+ ```
52
+ text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
53
+ ```
54
+ because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't, you will have decreased performance due to ``<UNK>``s and increased number of tokens per word.
55
+
56
+ ### Parameters:
57
+
58
+
59
+ | Parameter | Value |
60
+ |------------------|-------|
61
+ | Batch size | 16 |
62
+ | Training steps | 256k |
63
+ | Warmup steps | 500 |
64
+ | Uncased | True |
65
+ | Max. Seq. Length | 512 |
66
+ | Loss function | Contrastive Loss |
67
+
68
+ ### Evaluation
69
+
70
+ Evaluation is performed on Romaian STSb dataset
71
+
72
+
73
+ | Model | Spearman | Pearson |
74
+ |--------------------------------|:-----:|:------:|
75
+ | bert-base-romanian-uncased-v1 | 0.8086 | 0.8159 |
76
+ | sentence-bert-base-romanian-uncased-v1 | **0.8393** | **0.8387** |
77
+
78
+ ### Corpus
79
+
80
+ #### Pretraining
81
+
82
+ The model is trained on the following corpora (stats in the table below are after cleaning):
83
+
84
+ | Corpus | Lines(M) | Words(M) | Chars(B) | Size(GB) |
85
+ |-----------|:--------:|:--------:|:--------:|:--------:|
86
+ | OPUS | 55.05 | 635.04 | 4.045 | 3.8 |
87
+ | OSCAR | 33.56 | 1725.82 | 11.411 | 11 |
88
+ | Wikipedia | 1.54 | 60.47 | 0.411 | 0.4 |
89
+ | **Total** | **90.15** | **2421.33** | **15.867** | **15.2** |
90
+
91
+ #### Finetuning
92
+
93
+ The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs).
94
+
95
+ ### Citation
96
+
97
+ Paper coming soon
98
+
99
+
100
+ #### Acknowledgements
101
+
102
+ - We'd like to thank [Stefan Dumitrescu](https://github.com/dumitrescustefan) and [Andrei Marius Avram](https://github.com/avramandrei) for pretraining the v1.0 BERT models!