Nehc
/

e5-large-ru

Feature Extraction

sentence-transformers

Sentence Transformers

sentence-similarity

text-embeddings-inference

Inference Endpoints

Model card Files Files and versions Community

e5-large-ru / README.md

Nehc's picture

Update README.md

70b33bb verified 7 months ago

|

history blame contribute delete

No virus

2.58 kB

	---
	license: mit
	language:
	- ru
	- en
	tags:
	- mteb
	- Sentence Transformers
	- sentence-similarity
	- feature-extraction
	- sentence-transformers
	---
	# e5-large-ru

	Mod of https://huggingface.co/intfloat/multilingual-e5-large.
	Shrink tokenizer to 32K (ru+en) with David's Dale [manual](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) and invaluable assistance!
	Thank you, David! 🥰

	## Support for Sentence Transformers

	Below is an example for usage with sentence_transformers.
	```python
	from sentence_transformers import SentenceTransformer
	model = SentenceTransformer('Nehc/e5-large-ru')
	input_texts = ["passage: This is an example sentence", "passage: Каждый охотник желает знать.","query: Где сидит фазан?"]
	embeddings = model.encode(input_texts, normalize_embeddings=True)
	```

	Package requirements

	`pip install sentence_transformers~=2.2.2`

	Contributors: [michaelfeil](https://huggingface.co/michaelfeil)

	## FAQ

	1. Do I need to add the prefix "query: " and "passage: " to input texts?

	Yes, this is how the model is trained, otherwise you will see a performance degradation.

	Here are some rules of thumb:
	- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.

	- Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.

	- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.

	2. Why are my reproduced results slightly different from reported in the model card?

	Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.

	3. Why does the cosine similarity scores distribute around 0.7 to 1.0?

	This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.

	For text embedding tasks like text retrieval or semantic similarity,
	what matters is the relative order of the scores instead of the absolute values,
	so this should not be an issue.

	## Citation

	If you find our paper or models helpful, please consider cite as follows:

	```
	@article{wang2024multilingual,
	title={Multilingual E5 Text Embeddings: A Technical Report},
	author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
	journal={arXiv preprint arXiv:2402.05672},
	year={2024}
	}
	```

	## Limitations

	Long texts will be truncated to at most 512 tokens.