Nehc commited on
Commit
8e6542b
·
verified ·
1 Parent(s): ea1659f

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +73 -0
README.md CHANGED
@@ -1,3 +1,76 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - ru
5
+ - en
6
+ pipeline_tag: sentence-similarity
7
+ tags:
8
+ - mteb
9
+ - Sentence Transformers
10
+ - sentence-similarity
11
+ - feature-extraction
12
+ - sentence-transformers
13
  ---
14
+ # E5-large-ru
15
+
16
+ Mod of https://huggingface.co/intfloat/multilingual-e5-large.
17
+ Shrink tokenizer to 32K (ru+en) with David's Dale [manual](https://towardsdatascience.com/how-to-adapt-a-multilingual-t5-model-for-a-single-language-b9f94f3d9c90) and invaluable assistance!
18
+ Thank you, David! 🥰
19
+
20
+ ## Support for Sentence Transformers
21
+
22
+ Below is an example for usage with sentence_transformers.
23
+ ```python
24
+ from sentence_transformers import SentenceTransformer
25
+ model = SentenceTransformer('intfloat/multilingual-e5-large')
26
+ input_texts = ["passage: This is an example sentence", "passage: Каждый охотник желает знать.","query: Где сидит фазан?"]
27
+ embeddings = model.encode(input_texts, normalize_embeddings=True)
28
+ ```
29
+
30
+ Package requirements
31
+
32
+ `pip install sentence_transformers~=2.2.2`
33
+
34
+ Contributors: [michaelfeil](https://huggingface.co/michaelfeil)
35
+
36
+ ## FAQ
37
+
38
+ **1. Do I need to add the prefix "query: " and "passage: " to input texts?**
39
+
40
+ Yes, this is how the model is trained, otherwise you will see a performance degradation.
41
+
42
+ Here are some rules of thumb:
43
+ - Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
44
+
45
+ - Use "query: " prefix for symmetric tasks such as semantic similarity, bitext mining, paraphrase retrieval.
46
+
47
+ - Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
48
+
49
+ **2. Why are my reproduced results slightly different from reported in the model card?**
50
+
51
+ Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
52
+
53
+ **3. Why does the cosine similarity scores distribute around 0.7 to 1.0?**
54
+
55
+ This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
56
+
57
+ For text embedding tasks like text retrieval or semantic similarity,
58
+ what matters is the relative order of the scores instead of the absolute values,
59
+ so this should not be an issue.
60
+
61
+ ## Citation
62
+
63
+ If you find our paper or models helpful, please consider cite as follows:
64
+
65
+ ```
66
+ @article{wang2024multilingual,
67
+ title={Multilingual E5 Text Embeddings: A Technical Report},
68
+ author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Yang, Linjun and Majumder, Rangan and Wei, Furu},
69
+ journal={arXiv preprint arXiv:2402.05672},
70
+ year={2024}
71
+ }
72
+ ```
73
+
74
+ ## Limitations
75
+
76
+ Long texts will be truncated to at most 512 tokens.