omarelshehy
commited on
Commit
•
ebeede7
1
Parent(s):
1c8ff0a
Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,224 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
base_model: FacebookAI/xlm-roberta-large
|
3 |
+
library_name: sentence-transformers
|
4 |
+
metrics:
|
5 |
+
- pearson_cosine
|
6 |
+
- spearman_cosine
|
7 |
+
- pearson_manhattan
|
8 |
+
- spearman_manhattan
|
9 |
+
- pearson_euclidean
|
10 |
+
- spearman_euclidean
|
11 |
+
- pearson_dot
|
12 |
+
- spearman_dot
|
13 |
+
- pearson_max
|
14 |
+
- spearman_max
|
15 |
+
pipeline_tag: sentence-similarity
|
16 |
+
tags:
|
17 |
+
- sentence-transformers
|
18 |
+
- sentence-similarity
|
19 |
+
- feature-extraction
|
20 |
+
- mteb
|
21 |
+
model-index:
|
22 |
+
- name: omarelshehy/Arabic-English-Matryoshka-STS
|
23 |
+
results:
|
24 |
+
- dataset:
|
25 |
+
config: en-ar
|
26 |
+
name: MTEB STS17 (en-ar)
|
27 |
+
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
|
28 |
+
split: test
|
29 |
+
type: mteb/sts17-crosslingual-sts
|
30 |
+
metrics:
|
31 |
+
- type: cosine_pearson
|
32 |
+
value: 79.79480510851795
|
33 |
+
- type: cosine_spearman
|
34 |
+
value: 79.67609346073252
|
35 |
+
- type: euclidean_pearson
|
36 |
+
value: 81.64087935350051
|
37 |
+
- type: euclidean_spearman
|
38 |
+
value: 80.52588414802709
|
39 |
+
- type: main_score
|
40 |
+
value: 79.67609346073252
|
41 |
+
- type: manhattan_pearson
|
42 |
+
value: 81.57042957417305
|
43 |
+
- type: manhattan_spearman
|
44 |
+
value: 80.44331526051143
|
45 |
+
- type: pearson
|
46 |
+
value: 79.79480418294698
|
47 |
+
- type: spearman
|
48 |
+
value: 79.67609346073252
|
49 |
+
task:
|
50 |
+
type: STS
|
51 |
+
- dataset:
|
52 |
+
config: ar-ar
|
53 |
+
name: MTEB STS17 (ar-ar)
|
54 |
+
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
|
55 |
+
split: test
|
56 |
+
type: mteb/sts17-crosslingual-sts
|
57 |
+
metrics:
|
58 |
+
- type: cosine_pearson
|
59 |
+
value: 82.22889478671283
|
60 |
+
- type: cosine_spearman
|
61 |
+
value: 83.0533648934447
|
62 |
+
- type: euclidean_pearson
|
63 |
+
value: 81.15891941165452
|
64 |
+
- type: euclidean_spearman
|
65 |
+
value: 82.14034597386936
|
66 |
+
- type: main_score
|
67 |
+
value: 83.0533648934447
|
68 |
+
- type: manhattan_pearson
|
69 |
+
value: 81.17463976232014
|
70 |
+
- type: manhattan_spearman
|
71 |
+
value: 82.09804987736345
|
72 |
+
- type: pearson
|
73 |
+
value: 82.22889389569819
|
74 |
+
- type: spearman
|
75 |
+
value: 83.0529662284269
|
76 |
+
task:
|
77 |
+
type: STS
|
78 |
+
- dataset:
|
79 |
+
config: en-en
|
80 |
+
name: MTEB STS17 (en-en)
|
81 |
+
revision: faeb762787bd10488a50c8b5be4a3b82e411949c
|
82 |
+
split: test
|
83 |
+
type: mteb/sts17-crosslingual-sts
|
84 |
+
metrics:
|
85 |
+
- type: cosine_pearson
|
86 |
+
value: 87.17053120821998
|
87 |
+
- type: cosine_spearman
|
88 |
+
value: 87.05959159411456
|
89 |
+
- type: euclidean_pearson
|
90 |
+
value: 87.63706739480517
|
91 |
+
- type: euclidean_spearman
|
92 |
+
value: 87.7675347222274
|
93 |
+
- type: main_score
|
94 |
+
value: 87.05959159411456
|
95 |
+
- type: manhattan_pearson
|
96 |
+
value: 87.7006832512623
|
97 |
+
- type: manhattan_spearman
|
98 |
+
value: 87.80128473941168
|
99 |
+
- type: pearson
|
100 |
+
value: 87.17053012311975
|
101 |
+
- type: spearman
|
102 |
+
value: 87.05959159411456
|
103 |
+
task:
|
104 |
+
type: STS
|
105 |
+
Language:
|
106 |
+
- ar
|
107 |
+
- en
|
108 |
+
---
|
109 |
+
|
110 |
+
# SentenceTransformer based on FacebookAI/xlm-roberta-large
|
111 |
+
|
112 |
+
This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
|
113 |
+
|
114 |
+
## Model Details
|
115 |
+
|
116 |
+
### Model Description
|
117 |
+
- **Model Type:** Sentence Transformer
|
118 |
+
- **Base model:** [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) <!-- at revision c23d21b0620b635a76227c604d44e43a9f0ee389 -->
|
119 |
+
- **Maximum Sequence Length:** 512 tokens
|
120 |
+
- **Output Dimensionality:** 1024 tokens
|
121 |
+
- **Similarity Function:** Cosine Similarity
|
122 |
+
<!-- - **Training Dataset:** Unknown -->
|
123 |
+
<!-- - **Language:** Unknown -->
|
124 |
+
<!-- - **License:** Unknown -->
|
125 |
+
|
126 |
+
|
127 |
+
|
128 |
+
## Usage
|
129 |
+
|
130 |
+
### Direct Usage (Sentence Transformers)
|
131 |
+
|
132 |
+
First install the Sentence Transformers library:
|
133 |
+
|
134 |
+
```bash
|
135 |
+
pip install -U sentence-transformers
|
136 |
+
```
|
137 |
+
|
138 |
+
Then you can load this model and run inference.
|
139 |
+
```python
|
140 |
+
from sentence_transformers import SentenceTransformer
|
141 |
+
|
142 |
+
# Download from the 🤗 Hub
|
143 |
+
model = SentenceTransformer("omarelshehy/Arabic-English-Matryoshka-STS")
|
144 |
+
# Run inference
|
145 |
+
sentences = [
|
146 |
+
'حب سعيد الواضح للأدب والموسيقى الغربية يتصادم باستمرار مع غضبه الصالح لما فعله الغرب للبقية.',
|
147 |
+
'Said loves Western literature and music but is angry about what the West has done to the rest.',
|
148 |
+
'سعيد يعتقد أن الغرب لديه أفضل من كل شيء.',
|
149 |
+
]
|
150 |
+
embeddings = model.encode(sentences)
|
151 |
+
print(embeddings.shape)
|
152 |
+
# [3, 1024]
|
153 |
+
|
154 |
+
# Get the similarity scores for the embeddings
|
155 |
+
similarities = model.similarity(embeddings, embeddings)
|
156 |
+
print(similarities.shape)
|
157 |
+
# [3, 3]
|
158 |
+
```
|
159 |
+
|
160 |
+
<!--
|
161 |
+
### Direct Usage (Transformers)
|
162 |
+
|
163 |
+
<details><summary>Click to see the direct usage in Transformers</summary>
|
164 |
+
|
165 |
+
</details>
|
166 |
+
-->
|
167 |
+
|
168 |
+
<!--
|
169 |
+
### Downstream Usage (Sentence Transformers)
|
170 |
+
|
171 |
+
You can finetune this model on your own dataset.
|
172 |
+
|
173 |
+
<details><summary>Click to expand</summary>
|
174 |
+
|
175 |
+
</details>
|
176 |
+
-->
|
177 |
+
|
178 |
+
<!--
|
179 |
+
### Out-of-Scope Use
|
180 |
+
|
181 |
+
*List how the model may foreseeably be misused and address what users ought not to do with the model.*
|
182 |
+
-->
|
183 |
+
|
184 |
+
|
185 |
+
## Citation
|
186 |
+
|
187 |
+
### BibTeX
|
188 |
+
|
189 |
+
#### Sentence Transformers
|
190 |
+
```bibtex
|
191 |
+
@inproceedings{reimers-2019-sentence-bert,
|
192 |
+
title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
|
193 |
+
author = "Reimers, Nils and Gurevych, Iryna",
|
194 |
+
booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
|
195 |
+
month = "11",
|
196 |
+
year = "2019",
|
197 |
+
publisher = "Association for Computational Linguistics",
|
198 |
+
url = "https://arxiv.org/abs/1908.10084",
|
199 |
+
}
|
200 |
+
```
|
201 |
+
|
202 |
+
#### MatryoshkaLoss
|
203 |
+
```bibtex
|
204 |
+
@misc{kusupati2024matryoshka,
|
205 |
+
title={Matryoshka Representation Learning},
|
206 |
+
author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
|
207 |
+
year={2024},
|
208 |
+
eprint={2205.13147},
|
209 |
+
archivePrefix={arXiv},
|
210 |
+
primaryClass={cs.LG}
|
211 |
+
}
|
212 |
+
```
|
213 |
+
|
214 |
+
#### MultipleNegativesRankingLoss
|
215 |
+
```bibtex
|
216 |
+
@misc{henderson2017efficient,
|
217 |
+
title={Efficient Natural Language Response Suggestion for Smart Reply},
|
218 |
+
author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
|
219 |
+
year={2017},
|
220 |
+
eprint={1705.00652},
|
221 |
+
archivePrefix={arXiv},
|
222 |
+
primaryClass={cs.CL}
|
223 |
+
}
|
224 |
+
```
|