iliemihai commited on
Commit
ee5fb9c
1 Parent(s): 4166f33

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +41 -28
README.md CHANGED
@@ -6,9 +6,9 @@ tags:
6
  license: mit
7
  ---
8
 
9
- # bert-base-romanian-uncased-v1
10
 
11
- The BERT **base**, **uncased** model for Romanian, trained on a 15GB corpus, version ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
12
 
13
  ### How to use
14
 
@@ -28,6 +28,41 @@ outputs = model(input_ids)
28
  last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
29
  ```
30
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
31
  Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
32
  ```
33
  text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
@@ -44,7 +79,7 @@ because the model was **NOT** trained on cedilla ``s`` and ``t``s. If you don't,
44
  | Warmup steps | 500 |
45
  | Uncased | True |
46
  | Max. Seq. Length | 512 |
47
-
48
 
49
  ### Evaluation
50
 
@@ -71,35 +106,13 @@ The model is trained on the following corpora (stats in the table below are afte
71
 
72
  #### Finetuning
73
 
74
- The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian).
75
 
76
  ### Citation
77
 
78
- If you use this model in a research paper, I'd kindly ask you to cite the following paper:
79
 
80
- ```
81
- Stefan Dumitrescu, Andrei-Marius Avram, and Sampo Pyysalo. 2020. The birth of Romanian BERT. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 4324–4328, Online. Association for Computational Linguistics.
82
- ```
83
-
84
- or, in bibtex:
85
-
86
- ```
87
- @inproceedings{dumitrescu-etal-2020-birth,
88
- title = "The birth of {R}omanian {BERT}",
89
- author = "Dumitrescu, Stefan and
90
- Avram, Andrei-Marius and
91
- Pyysalo, Sampo",
92
- booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2020",
93
- month = nov,
94
- year = "2020",
95
- address = "Online",
96
- publisher = "Association for Computational Linguistics",
97
- url = "https://aclanthology.org/2020.findings-emnlp.387",
98
- doi = "10.18653/v1/2020.findings-emnlp.387",
99
- pages = "4324--4328",
100
- }
101
- ```
102
 
103
  #### Acknowledgements
104
 
105
- - We'd like to thank [Sampo Pyysalo](https://github.com/spyysalo) from TurkuNLP for helping us out with the compute needed to pretrain the v1.0 BERT models. He's awesome!
 
6
  license: mit
7
  ---
8
 
9
+ # sentence-bert-base-romanian-uncased-v1
10
 
11
+ The BERT **base**, **uncased** model for Romanian, finetuned on RO_MNLI dataset (translated entire MNLI dataset from English to Romanian) ![v1.0](https://img.shields.io/badge/v1.0-21%20Apr%202020-ff6666)
12
 
13
  ### How to use
14
 
 
28
  last_hidden_states = outputs[0] # The last hidden-state is the first element of the output tuple
29
  ```
30
 
31
+ Alternative use
32
+
33
+ ```python
34
+ from sentence_transformers import SentenceTransformer
35
+ import numpy as np
36
+
37
+ # Inițializăm modelul
38
+ model = SentenceTransformer("iliemihai/sentence-bert-base-romanian-uncased-v1")
39
+
40
+ # Definim propozițiile
41
+ sentences = [
42
+ "Un tren își începe călătoria către destinație.",
43
+ "O locomotivă pornește zgomotos spre o stație îndepărtată.",
44
+ "Un muzician cântă la un saxofon impresionant.",
45
+ "Un saxofonist evocă melodii suave sub lumina lunii.",
46
+ "O bucătăreasă presară condimente pe un platou cu legume.",
47
+ "Un chef adaugă un strop de mirodenii peste o salată colorată.",
48
+ "Un jongler își aruncă mingile colorate în aer.",
49
+ "Un artist de circ jonglează cu măiestrie sub reflectoare.",
50
+ "Un artist pictează un peisaj minunat pe o pânză albă.",
51
+ "Un pictor redă frumusețea naturii pe pânza sa strălucitoare."
52
+ ]
53
+
54
+ # Obținem embeddings pentru fiecare propoziție
55
+ embeddings = model.encode(sentences)
56
+
57
+ # Calculăm similaritatea semantică folosind similaritatea cosine
58
+ similarities = np.dot(embeddings, embeddings.T) / (np.linalg.norm(embeddings, axis=1)[:, np.newaxis] * np.linalg.norm(embeddings, axis=1)[np.newaxis, :])
59
+
60
+ # Afisăm similaritatea dintre propozitii
61
+ for i in range(len(sentences)):
62
+ for j in range(len(sentences)):
63
+ print(f"Similaritate între '{sentences[i]}' și '{sentences[j]}': {similarities[i, j]:.4f}")
64
+ ```
65
+
66
  Remember to always sanitize your text! Replace ``s`` and ``t`` cedilla-letters to comma-letters with :
67
  ```
68
  text = text.replace("ţ", "ț").replace("ş", "ș").replace("Ţ", "Ț").replace("Ş", "Ș")
 
79
  | Warmup steps | 500 |
80
  | Uncased | True |
81
  | Max. Seq. Length | 512 |
82
+ | Loss function | Contrastive Loss |
83
 
84
  ### Evaluation
85
 
 
106
 
107
  #### Finetuning
108
 
109
+ The model is finetune on the RO_MNLI dataset (translated entire MNLI dataset from English to Romanian and select only contradiction and entailment pairs, ~ 256k sentence pairs).
110
 
111
  ### Citation
112
 
113
+ Paper coming soon
114
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
115
 
116
  #### Acknowledgements
117
 
118
+ - We'd like to thank [Stefan Dumitrescu](https://github.com/dumitrescustefan) and [Andrei Marius Avram](https://github.com/avramandrei) for pretraining the v1.0 BERT models!