saumyaagupta commited on
Commit
9909ece
1 Parent(s): 3b9521e

Add new SentenceTransformer model.

Browse files
1_Pooling/config.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "word_embedding_dimension": 384,
3
+ "pooling_mode_cls_token": false,
4
+ "pooling_mode_mean_tokens": true,
5
+ "pooling_mode_max_tokens": false,
6
+ "pooling_mode_mean_sqrt_len_tokens": false,
7
+ "pooling_mode_weightedmean_tokens": false,
8
+ "pooling_mode_lasttoken": false,
9
+ "include_prompt": true
10
+ }
README.md ADDED
@@ -0,0 +1,253 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ language:
3
+ - en
4
+ library_name: sentence-transformers
5
+ tags:
6
+ - sentence-transformers
7
+ - feature-extraction
8
+ - sentence-similarity
9
+ - transformers
10
+ datasets:
11
+ - flax-sentence-embeddings/stackexchange_xml
12
+ - ms_marco
13
+ - gooaq
14
+ - yahoo_answers_topics
15
+ - search_qa
16
+ - eli5
17
+ - natural_questions
18
+ - trivia_qa
19
+ - embedding-data/QQP
20
+ - embedding-data/PAQ_pairs
21
+ - embedding-data/Amazon-QA
22
+ - embedding-data/WikiAnswers
23
+ pipeline_tag: sentence-similarity
24
+ ---
25
+
26
+ # multi-qa-MiniLM-L6-cos-v1
27
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for **semantic search**. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: [SBERT.net - Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
28
+
29
+
30
+ ## Usage (Sentence-Transformers)
31
+ Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
32
+
33
+ ```
34
+ pip install -U sentence-transformers
35
+ ```
36
+
37
+ Then you can use the model like this:
38
+ ```python
39
+ from sentence_transformers import SentenceTransformer, util
40
+
41
+ query = "How many people live in London?"
42
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
43
+
44
+ #Load the model
45
+ model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
46
+
47
+ #Encode query and documents
48
+ query_emb = model.encode(query)
49
+ doc_emb = model.encode(docs)
50
+
51
+ #Compute dot score between query and all document embeddings
52
+ scores = util.dot_score(query_emb, doc_emb)[0].cpu().tolist()
53
+
54
+ #Combine docs & scores
55
+ doc_score_pairs = list(zip(docs, scores))
56
+
57
+ #Sort by decreasing score
58
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
59
+
60
+ #Output passages & scores
61
+ for doc, score in doc_score_pairs:
62
+ print(score, doc)
63
+ ```
64
+
65
+
66
+ ## PyTorch Usage (HuggingFace Transformers)
67
+ Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.
68
+
69
+ ```python
70
+ from transformers import AutoTokenizer, AutoModel
71
+ import torch
72
+ import torch.nn.functional as F
73
+
74
+ #Mean Pooling - Take average of all tokens
75
+ def mean_pooling(model_output, attention_mask):
76
+ token_embeddings = model_output.last_hidden_state
77
+ input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
78
+ return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
79
+
80
+
81
+ #Encode text
82
+ def encode(texts):
83
+ # Tokenize sentences
84
+ encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='pt')
85
+
86
+ # Compute token embeddings
87
+ with torch.no_grad():
88
+ model_output = model(**encoded_input, return_dict=True)
89
+
90
+ # Perform pooling
91
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
92
+
93
+ # Normalize embeddings
94
+ embeddings = F.normalize(embeddings, p=2, dim=1)
95
+
96
+ return embeddings
97
+
98
+
99
+ # Sentences we want sentence embeddings for
100
+ query = "How many people live in London?"
101
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
102
+
103
+ # Load model from HuggingFace Hub
104
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
105
+ model = AutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
106
+
107
+ #Encode query and docs
108
+ query_emb = encode(query)
109
+ doc_emb = encode(docs)
110
+
111
+ #Compute dot score between query and all document embeddings
112
+ scores = torch.mm(query_emb, doc_emb.transpose(0, 1))[0].cpu().tolist()
113
+
114
+ #Combine docs & scores
115
+ doc_score_pairs = list(zip(docs, scores))
116
+
117
+ #Sort by decreasing score
118
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
119
+
120
+ #Output passages & scores
121
+ for doc, score in doc_score_pairs:
122
+ print(score, doc)
123
+ ```
124
+
125
+ ## TensorFlow Usage (HuggingFace Transformers)
126
+ Similarly to the PyTorch example above, to use the model with TensorFlow you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.
127
+
128
+ ```python
129
+ from transformers import AutoTokenizer, TFAutoModel
130
+ import tensorflow as tf
131
+
132
+ #Mean Pooling - Take attention mask into account for correct averaging
133
+ def mean_pooling(model_output, attention_mask):
134
+ token_embeddings = model_output.last_hidden_state
135
+ input_mask_expanded = tf.cast(tf.tile(tf.expand_dims(attention_mask, -1), [1, 1, token_embeddings.shape[-1]]), tf.float32)
136
+ return tf.math.reduce_sum(token_embeddings * input_mask_expanded, 1) / tf.math.maximum(tf.math.reduce_sum(input_mask_expanded, 1), 1e-9)
137
+
138
+
139
+ #Encode text
140
+ def encode(texts):
141
+ # Tokenize sentences
142
+ encoded_input = tokenizer(texts, padding=True, truncation=True, return_tensors='tf')
143
+
144
+ # Compute token embeddings
145
+ model_output = model(**encoded_input, return_dict=True)
146
+
147
+ # Perform pooling
148
+ embeddings = mean_pooling(model_output, encoded_input['attention_mask'])
149
+
150
+ # Normalize embeddings
151
+ embeddings = tf.math.l2_normalize(embeddings, axis=1)
152
+
153
+ return embeddings
154
+
155
+
156
+ # Sentences we want sentence embeddings for
157
+ query = "How many people live in London?"
158
+ docs = ["Around 9 Million people live in London", "London is known for its financial district"]
159
+
160
+ # Load model from HuggingFace Hub
161
+ tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
162
+ model = TFAutoModel.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
163
+
164
+ #Encode query and docs
165
+ query_emb = encode(query)
166
+ doc_emb = encode(docs)
167
+
168
+ #Compute dot score between query and all document embeddings
169
+ scores = (query_emb @ tf.transpose(doc_emb))[0].numpy().tolist()
170
+
171
+ #Combine docs & scores
172
+ doc_score_pairs = list(zip(docs, scores))
173
+
174
+ #Sort by decreasing score
175
+ doc_score_pairs = sorted(doc_score_pairs, key=lambda x: x[1], reverse=True)
176
+
177
+ #Output passages & scores
178
+ for doc, score in doc_score_pairs:
179
+ print(score, doc)
180
+ ```
181
+
182
+ ## Technical Details
183
+
184
+ In the following some technical details how this model must be used:
185
+
186
+ | Setting | Value |
187
+ | --- | :---: |
188
+ | Dimensions | 384 |
189
+ | Produces normalized embeddings | Yes |
190
+ | Pooling-Method | Mean pooling |
191
+ | Suitable score functions | dot-product (`util.dot_score`), cosine-similarity (`util.cos_sim`), or euclidean distance |
192
+
193
+ Note: When loaded with `sentence-transformers`, this model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used.
194
+
195
+ ----
196
+
197
+
198
+ ## Background
199
+
200
+ The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
201
+ contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
202
+
203
+ We developped this model during the
204
+ [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
205
+ organized by Hugging Face. We developped this model as part of the project:
206
+ [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
207
+
208
+ ## Intended uses
209
+
210
+ Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
211
+
212
+ Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
213
+
214
+
215
+
216
+ ## Training procedure
217
+
218
+ The full training script is accessible in this current repository: `train_script.py`.
219
+
220
+ ### Pre-training
221
+
222
+ We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
223
+
224
+ #### Training
225
+
226
+ We use the concatenation from multiple datasets to fine-tune our model. In total we have about 215M (question, answer) pairs.
227
+ We sampled each dataset given a weighted probability which configuration is detailed in the `data_config.json` file.
228
+
229
+ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/docs/package_reference/losses.html#multiplenegativesrankingloss) using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.
230
+
231
+
232
+
233
+
234
+ | Dataset | Number of training tuples |
235
+ |--------------------------------------------------------|:--------------------------:|
236
+ | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
237
+ | [PAQ](https://github.com/facebookresearch/PAQ) Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
238
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs from all StackExchanges | 25,316,456 |
239
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges | 21,396,559 |
240
+ | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
241
+ | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
242
+ | [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839
243
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
244
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
245
+ | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |
246
+ | [SearchQA](https://huggingface.co/datasets/search_qa) (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
247
+ | [ELI5](https://huggingface.co/datasets/eli5) (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
248
+ | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions pairs (titles) | 304,525 |
249
+ | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
250
+ | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
251
+ | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
252
+ | [TriviaQA](https://huggingface.co/datasets/trivia_qa) (Question, Evidence) pairs | 73,346 |
253
+ | **Total** | **214,988,242** |
config.json ADDED
@@ -0,0 +1,26 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_name_or_path": "sentence-transformers/multi-qa-MiniLM-L6-cos-v1",
3
+ "architectures": [
4
+ "BertModel"
5
+ ],
6
+ "attention_probs_dropout_prob": 0.1,
7
+ "classifier_dropout": null,
8
+ "gradient_checkpointing": false,
9
+ "hidden_act": "gelu",
10
+ "hidden_dropout_prob": 0.1,
11
+ "hidden_size": 384,
12
+ "initializer_range": 0.02,
13
+ "intermediate_size": 1536,
14
+ "layer_norm_eps": 1e-12,
15
+ "max_position_embeddings": 512,
16
+ "model_type": "bert",
17
+ "num_attention_heads": 12,
18
+ "num_hidden_layers": 6,
19
+ "pad_token_id": 0,
20
+ "position_embedding_type": "absolute",
21
+ "torch_dtype": "float32",
22
+ "transformers_version": "4.41.2",
23
+ "type_vocab_size": 2,
24
+ "use_cache": true,
25
+ "vocab_size": 30522
26
+ }
config_sentence_transformers.json ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "__version__": {
3
+ "sentence_transformers": "3.0.1",
4
+ "transformers": "4.41.2",
5
+ "pytorch": "2.3.0+cu121"
6
+ },
7
+ "prompts": {},
8
+ "default_prompt_name": null,
9
+ "similarity_fn_name": null
10
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:8d90079ab2ba1c51ec3aeae13b1991b3327fe50539131ea31c48e109047478d8
3
+ size 90864192
modules.json ADDED
@@ -0,0 +1,20 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ [
2
+ {
3
+ "idx": 0,
4
+ "name": "0",
5
+ "path": "",
6
+ "type": "sentence_transformers.models.Transformer"
7
+ },
8
+ {
9
+ "idx": 1,
10
+ "name": "1",
11
+ "path": "1_Pooling",
12
+ "type": "sentence_transformers.models.Pooling"
13
+ },
14
+ {
15
+ "idx": 2,
16
+ "name": "2",
17
+ "path": "2_Normalize",
18
+ "type": "sentence_transformers.models.Normalize"
19
+ }
20
+ ]
sentence_bert_config.json ADDED
@@ -0,0 +1,4 @@
 
 
 
 
 
1
+ {
2
+ "max_seq_length": 512,
3
+ "do_lower_case": false
4
+ }
special_tokens_map.json ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cls_token": {
3
+ "content": "[CLS]",
4
+ "lstrip": false,
5
+ "normalized": false,
6
+ "rstrip": false,
7
+ "single_word": false
8
+ },
9
+ "mask_token": {
10
+ "content": "[MASK]",
11
+ "lstrip": false,
12
+ "normalized": false,
13
+ "rstrip": false,
14
+ "single_word": false
15
+ },
16
+ "pad_token": {
17
+ "content": "[PAD]",
18
+ "lstrip": false,
19
+ "normalized": false,
20
+ "rstrip": false,
21
+ "single_word": false
22
+ },
23
+ "sep_token": {
24
+ "content": "[SEP]",
25
+ "lstrip": false,
26
+ "normalized": false,
27
+ "rstrip": false,
28
+ "single_word": false
29
+ },
30
+ "unk_token": {
31
+ "content": "[UNK]",
32
+ "lstrip": false,
33
+ "normalized": false,
34
+ "rstrip": false,
35
+ "single_word": false
36
+ }
37
+ }
tokenizer.json ADDED
The diff for this file is too large to render. See raw diff
 
tokenizer_config.json ADDED
@@ -0,0 +1,64 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "added_tokens_decoder": {
3
+ "0": {
4
+ "content": "[PAD]",
5
+ "lstrip": false,
6
+ "normalized": false,
7
+ "rstrip": false,
8
+ "single_word": false,
9
+ "special": true
10
+ },
11
+ "100": {
12
+ "content": "[UNK]",
13
+ "lstrip": false,
14
+ "normalized": false,
15
+ "rstrip": false,
16
+ "single_word": false,
17
+ "special": true
18
+ },
19
+ "101": {
20
+ "content": "[CLS]",
21
+ "lstrip": false,
22
+ "normalized": false,
23
+ "rstrip": false,
24
+ "single_word": false,
25
+ "special": true
26
+ },
27
+ "102": {
28
+ "content": "[SEP]",
29
+ "lstrip": false,
30
+ "normalized": false,
31
+ "rstrip": false,
32
+ "single_word": false,
33
+ "special": true
34
+ },
35
+ "103": {
36
+ "content": "[MASK]",
37
+ "lstrip": false,
38
+ "normalized": false,
39
+ "rstrip": false,
40
+ "single_word": false,
41
+ "special": true
42
+ }
43
+ },
44
+ "clean_up_tokenization_spaces": true,
45
+ "cls_token": "[CLS]",
46
+ "do_basic_tokenize": true,
47
+ "do_lower_case": true,
48
+ "mask_token": "[MASK]",
49
+ "max_length": 250,
50
+ "model_max_length": 512,
51
+ "never_split": null,
52
+ "pad_to_multiple_of": null,
53
+ "pad_token": "[PAD]",
54
+ "pad_token_type_id": 0,
55
+ "padding_side": "right",
56
+ "sep_token": "[SEP]",
57
+ "stride": 0,
58
+ "strip_accents": null,
59
+ "tokenize_chinese_chars": true,
60
+ "tokenizer_class": "BertTokenizer",
61
+ "truncation_side": "right",
62
+ "truncation_strategy": "longest_first",
63
+ "unk_token": "[UNK]"
64
+ }
vocab.txt ADDED
The diff for this file is too large to render. See raw diff