gruber
/

e5-small-v2-ggml

Sentence Similarity

Model card Files Files and versions Community

gruber commited on Aug 26, 2023

Commit

4e92a48

•

1 Parent(s): 674e225

added citations

Files changed (1) hide show

README.md +50 -2

README.md CHANGED Viewed

@@ -12,7 +12,11 @@ tags:
 # Model details
-This repository contains the files used on [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) converted to **GGM**L to be used on the [bert.cpp backend](https://github.com/skeskinen/bert.cpp).
 ---
@@ -33,4 +37,48 @@ This repository contains the files used on [intfloat/e5-small-v2](https://huggin
 | f32 | 0.8493 | 58.69 | 0.4742 | 148.92 |
 | f16 | 0.8493 | 58.78 | 0.4743 | 73.68 |
 | q4_0 | 0.8482 | 134.13 | 0.4738 | 167.89 |
-| q4_1 | 0.8445 | 94.88 | 0.4616 | 166.26 |

 # Model details
+This repository contains the files used on [intfloat/e5-small-v2](https://huggingface.co/intfloat/e5-small-v2) converted to **GGML** to be used on the [bert.cpp backend](https://github.com/skeskinen/bert.cpp).
+> - [Text Embeddings by Weakly-Supervised Contrastive Pre-training](https://arxiv.org/pdf/2212.03533.pdf).
+> - Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, Furu Wei, arXiv 2022
+> - This model has 12 layers and the embedding size is 384.
 ---
 | f32 | 0.8493 | 58.69 | 0.4742 | 148.92 |
 | f16 | 0.8493 | 58.78 | 0.4743 | 73.68 |
 | q4_0 | 0.8482 | 134.13 | 0.4738 | 167.89 |
+| q4_1 | 0.8445 | 94.88 | 0.4616 | 166.26 |
+---
+## FAQ
+**1. Do I need to add the prefix "query: " and "passage: " to input texts?**
+Yes, this is how the model is trained, otherwise you will see a performance degradation.
+Here are some rules of thumb:
+- Use "query: " and "passage: " correspondingly for asymmetric tasks such as passage retrieval in open QA, ad-hoc information retrieval.
+- Use "query: " prefix for symmetric tasks such as semantic similarity, paraphrase retrieval.
+- Use "query: " prefix if you want to use embeddings as features, such as linear probing classification, clustering.
+**2. Why are my reproduced results slightly different from reported in the model card?**
+Different versions of `transformers` and `pytorch` could cause negligible but non-zero performance differences.
+**3. Why does the cosine similarity scores distribute around 0.7 to 1.0?**
+This is a known and expected behavior as we use a low temperature 0.01 for InfoNCE contrastive loss.
+For text embedding tasks like text retrieval or semantic similarity,
+what matters is the relative order of the scores instead of the absolute values,
+so this should not be an issue.
+## Citation
+If you find our paper or models helpful, please consider cite as follows:
+```
+@article{wang2022text,
+  title={Text Embeddings by Weakly-Supervised Contrastive Pre-training},
+  author={Wang, Liang and Yang, Nan and Huang, Xiaolong and Jiao, Binxing and Yang, Linjun and Jiang, Daxin and Majumder, Rangan and Wei, Furu},
+  journal={arXiv preprint arXiv:2212.03533},
+  year={2022}
+}
+```
+## Limitations
+This model only works for English texts. Long texts will be truncated to at most 512 tokens.