e5-large-ru

Mod of https://huggingface.co/intfloat/multilingual-e5-large. Shrink tokenizer to 32K (ru+en) with David's Dale manual and invaluable assistance! Thank you, David! 🥰

Support for Sentence Transformers

Below is an example for usage with sentence_transformers.

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('Nehc/e5-large-ru')
input_texts = ["passage: This is an example sentence", "passage: Каждый охотник желает знать.","query: Где сидит фазан?"]
embeddings = model.encode(input_texts, normalize_embeddings=True)

Package requirements

pip install sentence_transformers~=2.2.2

Contributors: michaelfeil

FAQ

1. Do I need to add the prefix "query: " and "passage: " to input texts?

Yes, this is how the model is trained, otherwise you will see a performance degradation.

Here are some rules of thumb:

  • Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.

  • Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.

  • Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

2. Why are my reproduced results slightly different from reported in the model card?

Different versions of transformers and pytorch could cause negligible but non-zero performance differences.

3. Why does the cosine similarity scores distribute around 0.7 to 1.0?

This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.

For text embedding tasks like text retrieval or semantic similarity, what matters is the relative order of the scores instead of the absolute values, so this should not be an issue.

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{wang2024multilingual,
  title={Multilingual E5 Text Embeddings: A Technical Report},
  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
  journal={arXiv preprint arXiv:2402.05672},
  year={2024}
}

Limitations

Long texts will be truncated to at most 512 tokens.

Downloads last month
25
Safetensors
Model size
337M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Spaces using Nehc/e5-large-ru 4