omarelshehy
/

arabic-english-sts-matryoshka

+---
+base_model: FacebookAI/xlm-roberta-large
+library_name: sentence-transformers
+metrics:
+- pearson_cosine
+- spearman_cosine
+- pearson_manhattan
+- spearman_manhattan
+- pearson_euclidean
+- spearman_euclidean
+- pearson_dot
+- spearman_dot
+- pearson_max
+- spearman_max
+pipeline_tag: sentence-similarity
+tags:
+- sentence-transformers
+- sentence-similarity
+- feature-extraction
+- mteb
+model-index:
+- name: omarelshehy/Arabic-English-Matryoshka-STS
+  results:
+  - dataset:
+      config: en-ar
+      name: MTEB STS17 (en-ar)
+      revision: faeb762787bd10488a50c8b5be4a3b82e411949c
+      split: test
+      type: mteb/sts17-crosslingual-sts
+    metrics:
+    - type: cosine_pearson
+      value: 79.79480510851795
+    - type: cosine_spearman
+      value: 79.67609346073252
+    - type: euclidean_pearson
+      value: 81.64087935350051
+    - type: euclidean_spearman
+      value: 80.52588414802709
+    - type: main_score
+      value: 79.67609346073252
+    - type: manhattan_pearson
+      value: 81.57042957417305
+    - type: manhattan_spearman
+      value: 80.44331526051143
+    - type: pearson
+      value: 79.79480418294698
+    - type: spearman
+      value: 79.67609346073252
+    task:
+      type: STS
+  - dataset:
+      config: ar-ar
+      name: MTEB STS17 (ar-ar)
+      revision: faeb762787bd10488a50c8b5be4a3b82e411949c
+      split: test
+      type: mteb/sts17-crosslingual-sts
+    metrics:
+    - type: cosine_pearson
+      value: 82.22889478671283
+    - type: cosine_spearman
+      value: 83.0533648934447
+    - type: euclidean_pearson
+      value: 81.15891941165452
+    - type: euclidean_spearman
+      value: 82.14034597386936
+    - type: main_score
+      value: 83.0533648934447
+    - type: manhattan_pearson
+      value: 81.17463976232014
+    - type: manhattan_spearman
+      value: 82.09804987736345
+    - type: pearson
+      value: 82.22889389569819
+    - type: spearman
+      value: 83.0529662284269
+    task:
+      type: STS
+  - dataset:
+      config: en-en
+      name: MTEB STS17 (en-en)
+      revision: faeb762787bd10488a50c8b5be4a3b82e411949c
+      split: test
+      type: mteb/sts17-crosslingual-sts
+    metrics:
+    - type: cosine_pearson
+      value: 87.17053120821998
+    - type: cosine_spearman
+      value: 87.05959159411456
+    - type: euclidean_pearson
+      value: 87.63706739480517
+    - type: euclidean_spearman
+      value: 87.7675347222274
+    - type: main_score
+      value: 87.05959159411456
+    - type: manhattan_pearson
+      value: 87.7006832512623
+    - type: manhattan_spearman
+      value: 87.80128473941168
+    - type: pearson
+      value: 87.17053012311975
+    - type: spearman
+      value: 87.05959159411456
+    task:
+      type: STS
+Language:
+- ar
+- en
+---
+# SentenceTransformer based on FacebookAI/xlm-roberta-large
+This is a [sentence-transformers](https://www.SBERT.net) model finetuned from [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large). It maps sentences & paragraphs to a 1024-dimensional dense vector space and can be used for semantic textual similarity, semantic search, paraphrase mining, text classification, clustering, and more.
+## Model Details
+### Model Description
+- **Model Type:** Sentence Transformer
+- **Base model:** [FacebookAI/xlm-roberta-large](https://huggingface.co/FacebookAI/xlm-roberta-large) <!-- at revision c23d21b0620b635a76227c604d44e43a9f0ee389 -->
+- **Maximum Sequence Length:** 512 tokens
+- **Output Dimensionality:** 1024 tokens
+- **Similarity Function:** Cosine Similarity
+<!-- - **Training Dataset:** Unknown -->
+<!-- - **Language:** Unknown -->
+<!-- - **License:** Unknown -->
+## Usage
+### Direct Usage (Sentence Transformers)
+First install the Sentence Transformers library:
+```bash
+pip install -U sentence-transformers
+```
+Then you can load this model and run inference.
+```python
+from sentence_transformers import SentenceTransformer
+# Download from the 🤗 Hub
+model = SentenceTransformer("omarelshehy/Arabic-English-Matryoshka-STS")
+# Run inference
+sentences = [
+    'حب سعيد الواضح للأدب والموسيقى الغربية يتصادم باستمرار مع غضبه الصالح لما فعله الغرب للبقية.',
+    'Said loves Western literature and music but is angry about what the West has done to the rest.',
+    'سعيد يعتقد أن الغرب لديه أفضل من كل شيء.',
+]
+embeddings = model.encode(sentences)
+print(embeddings.shape)
+# [3, 1024]
+# Get the similarity scores for the embeddings
+similarities = model.similarity(embeddings, embeddings)
+print(similarities.shape)
+# [3, 3]
+```
+<!--
+### Direct Usage (Transformers)
+<details><summary>Click to see the direct usage in Transformers</summary>
+</details>
+-->
+<!--
+### Downstream Usage (Sentence Transformers)
+You can finetune this model on your own dataset.
+<details><summary>Click to expand</summary>
+</details>
+-->
+<!--
+### Out-of-Scope Use
+*List how the model may foreseeably be misused and address what users ought not to do with the model.*
+-->
+## Citation
+### BibTeX
+#### Sentence Transformers
+```bibtex
+@inproceedings{reimers-2019-sentence-bert,
+    title = "Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks",
+    author = "Reimers, Nils and Gurevych, Iryna",
+    booktitle = "Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing",
+    month = "11",
+    year = "2019",
+    publisher = "Association for Computational Linguistics",
+    url = "https://arxiv.org/abs/1908.10084",
+}
+```
+#### MatryoshkaLoss
+```bibtex
+@misc{kusupati2024matryoshka,
+    title={Matryoshka Representation Learning},
+    author={Aditya Kusupati and Gantavya Bhatt and Aniket Rege and Matthew Wallingford and Aditya Sinha and Vivek Ramanujan and William Howard-Snyder and Kaifeng Chen and Sham Kakade and Prateek Jain and Ali Farhadi},
+    year={2024},
+    eprint={2205.13147},
+    archivePrefix={arXiv},
+    primaryClass={cs.LG}
+}
+```
+#### MultipleNegativesRankingLoss
+```bibtex
+@misc{henderson2017efficient,
+    title={Efficient Natural Language Response Suggestion for Smart Reply},
+    author={Matthew Henderson and Rami Al-Rfou and Brian Strope and Yun-hsuan Sung and Laszlo Lukacs and Ruiqi Guo and Sanjiv Kumar and Balint Miklos and Ray Kurzweil},
+    year={2017},
+    eprint={1705.00652},
+    archivePrefix={arXiv},
+    primaryClass={cs.CL}
+}
+```