added citations
Browse files
README.md
CHANGED
@@ -12,7 +12,11 @@ tags:
|
|
12 |
|
13 |
# Model details
|
14 |
|
15 |
-
This repository contains the files used on [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) converted to **
|
|
|
|
|
|
|
|
|
16 |
|
17 |
---
|
18 |
|
@@ -33,4 +37,48 @@ This repository contains the files used on [intfloat/e5-small-v2](https://huggin
|
|
33 |
| f32 | 0.8493 | 58.69 | 0.4742 | 148.92 |
|
34 |
| f16 | 0.8493 | 58.78 | 0.4743 | 73.68 |
|
35 |
| q4_0 | 0.8482 | 134.13 | 0.4738 | 167.89 |
|
36 |
-
| q4_1 | 0.8445 | 94.88 | 0.4616 | 166.26 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
12 |
|
13 |
# Model details
|
14 |
|
15 |
+
This repository contains the files used on [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) converted to **GGML** to be used on the [bert.cpp backend](https://github.com/skeskinen/bert.cpp).
|
16 |
+
|
17 |
+
> - [Text Embeddings by Weakly-Supervised Contrastive Pre-training](https://arxiv.org/pdf/2212.03533.pdf).
|
18 |
+
> - Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022
|
19 |
+
> - This model has 12 layers and the embedding size is 384.
|
20 |
|
21 |
---
|
22 |
|
|
|
37 |
| f32 | 0.8493 | 58.69 | 0.4742 | 148.92 |
|
38 |
| f16 | 0.8493 | 58.78 | 0.4743 | 73.68 |
|
39 |
| q4_0 | 0.8482 | 134.13 | 0.4738 | 167.89 |
|
40 |
+
| q4_1 | 0.8445 | 94.88 | 0.4616 | 166.26 |
|
41 |
+
|
42 |
+
---
|
43 |
+
|
44 |
+
## FAQ
|
45 |
+
|
46 |
+
**1. Do I need to add the prefix "query: " and "passage: " to input texts?**
|
47 |
+
|
48 |
+
Yes, this is how the model is trained, otherwise you will see a performance degradation.
|
49 |
+
|
50 |
+
Here are some rules of thumb:
|
51 |
+
- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
|
52 |
+
|
53 |
+
- Use "query: " prefix for symmetric tasks such as semantic similarity, paraphrase retrieval.
|
54 |
+
|
55 |
+
- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
|
56 |
+
|
57 |
+
**2. Why are my reproduced results slightly different from reported in the model card?**
|
58 |
+
|
59 |
+
Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
|
60 |
+
|
61 |
+
**3. Why does the cosine similarity scores distribute around 0.7 to 1.0?**
|
62 |
+
|
63 |
+
This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
|
64 |
+
|
65 |
+
For text embedding tasks like text retrieval or semantic similarity,
|
66 |
+
what matters is the relative order of the scores instead of the absolute values,
|
67 |
+
so this should not be an issue.
|
68 |
+
|
69 |
+
## Citation
|
70 |
+
|
71 |
+
If you find our paper or models helpful, please consider cite as follows:
|
72 |
+
|
73 |
+
```
|
74 |
+
@article{wang2022text,
|
75 |
+
title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
|
76 |
+
author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
|
77 |
+
journal={arXiv preprint arXiv:2212.03533},
|
78 |
+
year={2022}
|
79 |
+
}
|
80 |
+
```
|
81 |
+
|
82 |
+
## Limitations
|
83 |
+
|
84 |
+
This model only works for English texts. Long texts will be truncated to at most 512 tokens.
|