|
--- |
|
license: mit |
|
language: |
|
- ru |
|
- en |
|
tags: |
|
- mteb |
|
- Sentence Transformers |
|
- sentence-similarity |
|
- feature-extraction |
|
- sentence-transformers |
|
--- |
|
# e5-large-ru |
|
|
|
Mod of https://huggingface.co/intfloat/multilingual-e5-large. |
|
Shrink tokenizer to 32K (ru+en) with David's Dale [manual](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) and invaluable assistance! |
|
Thank you, David! 🥰 |
|
|
|
## Support for Sentence Transformers |
|
|
|
Below is an example for usage with sentence_transformers. |
|
```python |
|
from sentence_transformers import SentenceTransformer |
|
model = SentenceTransformer('Nehc/e5-large-ru') |
|
input_texts = ["passage: This is an example sentence", "passage: Каждый охотник желает знать.","query: Где сидит фазан?"] |
|
embeddings = model.encode(input_texts, normalize_embeddings=True) |
|
``` |
|
|
|
Package requirements |
|
|
|
`pip install sentence_transformers~=2.2.2` |
|
|
|
Contributors: [michaelfeil](https://huggingface.co/michaelfeil) |
|
|
|
## FAQ |
|
|
|
**1. Do I need to add the prefix "query: " and "passage: " to input texts?** |
|
|
|
Yes, this is how the model is trained, otherwise you will see a performance degradation. |
|
|
|
Here are some rules of thumb: |
|
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval. |
|
|
|
- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval. |
|
|
|
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering. |
|
|
|
**2. Why are my reproduced results slightly different from reported in the model card?** |
|
|
|
Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences. |
|
|
|
**3. Why does the cosine similarity scores distribute around 0.7 to 1.0?** |
|
|
|
This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss. |
|
|
|
For text embedding tasks like text retrieval or semantic similarity, |
|
what matters is the relative order of the scores instead of the absolute values, |
|
so this should not be an issue. |
|
|
|
## Citation |
|
|
|
If you find our paper or models helpful, please consider cite as follows: |
|
|
|
``` |
|
@article{wang2024multilingual, |
|
title={Multilingual E5 Text Embeddings: A Technical Report}, |
|
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu}, |
|
journal={arXiv preprint arXiv:2402.05672}, |
|
year={2024} |
|
} |
|
``` |
|
|
|
## Limitations |
|
|
|
Long texts will be truncated to at most 512 tokens. |