Sentence Similarity
sentence-transformers
English
feature-extraction
Inference Endpoints
phi0112358 commited on
Commit
463d63a
1 Parent(s): b68288c

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -1
README.md CHANGED
@@ -1,3 +1,197 @@
1
  ---
2
- license: apache-2.0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
+ language:
3
+ - en
4
+ pipeline_tag: sentence-similarity
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ datasets:
10
+ - flax-sentence-embeddings/stackexchange_xml
11
+ - ms_marco
12
+ - gooaq
13
+ - yahoo_answers_topics
14
+ - search_qa
15
+ - eli5
16
+ - natural_questions
17
+ - trivia_qa
18
+ - embedding-data/QQP
19
+ - embedding-data/PAQ_pairs
20
+ - embedding-data/Amazon-QA
21
+ - embedding-data/WikiAnswers
22
  ---
23
+
24
+ # multi-qa-MiniLM-L6-cos-v1-GGML
25
+ This is a [sentence-transformers](https://www.SBERT.net) model aimed to be used with **bert.cpp by Gerganov's GGML Library**: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for **semantic search**. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: [SBERT.net - Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
26
+
27
+
28
+ ## Usage (Start Server)
29
+ Using this model becomes easy when you have bert.cpp installed:
30
+
31
+ ```
32
+ ./build/bin/server -m models/all-MiniLM-L6-v2/ggml-model-q4_0.bin --port 8085
33
+
34
+ # bert_model_load: loading model from 'models/all-MiniLM-L6-v2/ggml-model-q4_0.bin' - please wait ...
35
+ # bert_model_load: n_vocab = 30522
36
+ # bert_model_load: n_ctx = 512
37
+ # bert_model_load: n_embd = 384
38
+ # bert_model_load: n_intermediate = 1536
39
+ # bert_model_load: n_head = 12
40
+ # bert_model_load: n_layer = 6
41
+ # bert_model_load: f16 = 2
42
+ # bert_model_load: ggml ctx size = 13.57 MB
43
+ # bert_model_load: ............ done
44
+ # bert_model_load: model size = 13.55 MB / num tensors = 101
45
+ # Server running on port 8085 with 4 threads
46
+ # Waiting for a client
47
+ ```
48
+
49
+
50
+
51
+ ## Usage (Start Client)
52
+ Then you can use the model like this:
53
+
54
+
55
+ ```
56
+ python3 examples/sample_client.py 8085
57
+ # Loading texts from sample_client_texts.txt...
58
+ # Loaded 1738 lines.
59
+ # Starting with a test query "Should I get health insurance?"
60
+ # Closest texts:
61
+ # 1. Will my Medicare premiums be higher because of my higher income?
62
+ # (similarity score: 0.4844)
63
+ # 2. Can I sign up for Medicare Part B if I am working and have health insurance through an employer?
64
+ # (similarity score: 0.4575)
65
+ # 3. Should I sign up for Medicare Part B if I have Veterans' Benefits?
66
+ # (similarity score: 0.4052)
67
+ # Enter a text to find similar texts (enter 'q' to quit): expensive
68
+ # Closest texts:
69
+ # 1. It is priced at $ 5,995 for an unlimited number of users tapping into the single processor , or $ 195 per user with a minimum of five users .
70
+ # (similarity score: 0.4597)
71
+ # 2. The new system costs between $ 1.1 million and $ 22 million , depending on configuration .
72
+ # (similarity score: 0.4547)
73
+ # 3. Each hull will cost about $ 1.4 billion , with each fully outfitted submarine costing about $ 2.2 billion , Young said .
74
+ # (similarity score: 0.4078)
75
+ ```
76
+
77
+ ### Converting models to ggml format
78
+ Converting models is similar to llama.cpp. Use models/convert-to-ggml.py to make hf models into either f32 or f16 ggml models. Then use ./build/bin/quantize to turn those into Q4_0, 4bit per weight models.
79
+
80
+ There is also models/run_conversions.sh which creates all 4 versions (f32, f16, Q4_0, Q4_1) at once.
81
+ ```sh
82
+ cd models
83
+ # Clone a model from hf
84
+ git clone https://huggingface.co/sentence-transformers/multi-qa-MiniLM-L6-cos-v1
85
+ # Run conversions to 4 ggml formats (f32, f16, Q4_0, Q4_1)
86
+ sh run_conversions.sh multi-qa-MiniLM-L6-cos-v1
87
+ ```
88
+
89
+ ## Technical Details
90
+
91
+ In the following some technical details how this model must be used:
92
+
93
+ | Setting | Value |
94
+ | --- | :---: |
95
+ | Dimensions | 384 |
96
+ | Produces normalized embeddings | Yes |
97
+ | Pooling-Method | Mean pooling |
98
+ | Suitable score functions | dot-product (`util.dot_score`), cosine-similarity (`util.cos_sim`), or euclidean distance |
99
+
100
+ Note: When loaded with `sentence-transformers`, this model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used.
101
+
102
+
103
+ ## Benchmarks
104
+ Running MTEB (Massive Text Embedding Benchmark) with bert.cpp vs. [sbert](https://sbert.net/)(cpu mode) gives comparable results between the two, with quantization having minimal effect on accuracy and eval time being similar or better than sbert with batch_size=1 (bert.cpp doesn't support batching).
105
+
106
+ See [benchmarks](benchmarks) more info.
107
+ ### all-MiniLM-L6-v2
108
+ | Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
109
+ |-----------|-----------|------------|-----------|------------|
110
+ | f32 | 0.8201 | 6.83 | 0.4082 | 11.34 |
111
+ | f16 | 0.8201 | 6.17 | 0.4085 | 10.28 |
112
+ | q4_0 | 0.8175 | 5.45 | 0.3911 | 10.63 |
113
+ | q4_1 | 0.8223 | 6.79 | 0.4027 | 11.41 |
114
+ | sbert | 0.8203 | 2.74 | 0.4085 | 5.56 |
115
+ | sbert-batchless | 0.8203 | 13.10 | 0.4085 | 15.52 |
116
+
117
+ ### all-MiniLM-L12-v2
118
+ | Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
119
+ |-----------|-----------|------------|-----------|------------|
120
+ | f32 | 0.8306 | 13.36 | 0.4117 | 21.23 |
121
+ | f16 | 0.8306 | 11.51 | 0.4119 | 20.08 |
122
+ | q4_0 | 0.8310 | 11.27 | 0.4183 | 20.81 |
123
+ | q4_1 | 0.8325 | 12.37 | 0.4093 | 19.38 |
124
+ | sbert | 0.8309 | 5.11 | 0.4117 | 8.93 |
125
+ | sbert-batchless | 0.8309 | 22.81 | 0.4117 | 28.04 |
126
+
127
+ ### bert-base-uncased
128
+ bert-base-uncased is not a very good sentence embeddings model, but it's here to show that bert.cpp correctly runs models that are not from SentenceTransformers. Technically any hf model with architecture `BertModel` or `BertForMaskedLM` should work.
129
+
130
+ | Data Type | STSBenchmark | eval time | EmotionClassification | eval time |
131
+ |-----------|-----------|------------|-----------|------------|
132
+ | f32 | 0.4738 | 52.38 | 0.3361 | 88.56 |
133
+ | f16 | 0.4739 | 33.24 | 0.3361 | 55.86 |
134
+ | q4_0 | 0.4940 | 33.93 | 0.3375 | 57.82 |
135
+ | q4_1 | 0.4612 | 36.86 | 0.3318 | 59.63 |
136
+ | sbert | 0.4729 | 16.97 | 0.3527 | 28.77 |
137
+ | sbert-batchless | 0.4729 | 69.97 | 0.3526 | 79.02 |
138
+
139
+ ----
140
+
141
+
142
+ ## Background
143
+
144
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
145
+ contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
146
+
147
+ We developped this model during the
148
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
149
+ organized by Hugging Face. We developped this model as part of the project:
150
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
151
+
152
+ ## Intended uses
153
+
154
+ Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
155
+
156
+ Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
157
+
158
+
159
+
160
+ ## Training procedure
161
+
162
+ The full training script is accessible in this current repository: `train_script.py`.
163
+
164
+ ### Pre-training
165
+
166
+ We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
167
+
168
+ #### Training
169
+
170
+ We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
171
+ We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
172
+
173
+ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
174
+
175
+
176
+
177
+
178
+ | Dataset | Number of training tuples |
179
+ |--------------------------------------------------------|:--------------------------:|
180
+ | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
181
+ | [PAQ](https://github.com/facebookresearch/PAQ) Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
182
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs from all StackExchanges | 25,316,456 |
183
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges | 21,396,559 |
184
+ | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
185
+ | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
186
+ | [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839
187
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
188
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
189
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |
190
+ | [SearchQA](https://huggingface.co/datasets/search_qa) (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
191
+ | [ELI5](https://huggingface.co/datasets/eli5) (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
192
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions pairs (titles) | 304,525 |
193
+ | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
194
+ | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
195
+ | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
196
+ | [TriviaQA](https://huggingface.co/datasets/trivia_qa) (Question, Evidence) pairs | 73,346 |
197
+ | **Total** | **214,988,242** |