PBusienei commited on
Commit
a4cd648
1 Parent(s): 0b249e3

Updated Readme

Browse files
Files changed (1) hide show
  1. README.md +16 -37
README.md CHANGED
@@ -7,22 +7,22 @@ tags:
7
  ---
8
 
9
  # multi-qa-MiniLM-L6-cos-v1
10
- This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for **semantic search**. It has been trained on 215M (question, answer) pairs from diverse sources. For an introduction to semantic search, have a look at: [SBERT.net - Semantic Search](https://www.sbert.net/examples/applications/semantic-search/README.html)
11
 
12
 
13
  ## Usage (Sentence-Transformers)
14
- Using this model becomes easy when you have [sentence-transformers](https://www.SBERT.net) installed:
15
 
16
  ```
17
  pip install -U sentence-transformers
18
  ```
19
 
20
- Then you can use the model like this:
21
  ```python
22
  from sentence_transformers import SentenceTransformer, util
23
 
24
- query = "How many people live in London?"
25
- docs = ["Around 9 Million people live in London", "London is known for its financial district"]
26
 
27
  #Load the model
28
  model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
@@ -47,16 +47,19 @@ for doc, score in doc_score_pairs:
47
 
48
 
49
  ## Usage (HuggingFace Transformers)
50
- Without [sentence-transformers](https://www.SBERT.net), you can use the model like this: First, you pass your input through the transformer model, then you have to apply the correct pooling-operation on-top of the contextualized word embeddings.
 
 
51
 
52
  ```python
 
53
  from transformers import AutoTokenizer, AutoModel
54
  import torch
55
  import torch.nn.functional as F
56
 
57
  #Mean Pooling - Take average of all tokens
58
  def mean_pooling(model_output, attention_mask):
59
- token_embeddings = model_output.last_hidden_state #First element of model_output contains all token embeddings
60
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
61
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
62
 
@@ -80,8 +83,8 @@ def encode(texts):
80
 
81
 
82
  # Sentences we want sentence embeddings for
83
- query = "How many people live in London?"
84
- docs = ["Around 9 Million people live in London", "London is known for its financial district"]
85
 
86
  # Load model from HuggingFace Hub
87
  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
@@ -126,14 +129,11 @@ Note: When loaded with `sentence-transformers`, this model produces normalized e
126
  The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
127
  contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
128
 
129
- We developped this model during the
130
- [Community week using JAX/Flax for NLP & CV](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104),
131
- organized by Hugging Face. We developped this model as part of the project:
132
- [Train the Best Sentence Embedding Model Ever with 1B Training Pairs](https://discuss.huggingface.co/t/train-the-best-sentence-embedding-model-ever-with-1b-training-pairs/7354). We benefited from efficient hardware infrastructure to run the project: 7 TPUs v3-8, as well as intervention from Googles Flax, JAX, and Cloud team member about efficient deep learning frameworks.
133
 
134
  ## Intended uses
135
 
136
- Our model is intented to be used for semantic search: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
137
 
138
  Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
139
 
@@ -141,11 +141,11 @@ Note that there is a limit of 512 word pieces: Text longer than that will be tru
141
 
142
  ## Training procedure
143
 
144
- The full training script is accessible in this current repository: `train_script.py`.
145
 
146
  ### Pre-training
147
 
148
- We use the pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model. Please refer to the model card for more detailed information about the pre-training procedure.
149
 
150
  #### Training
151
 
@@ -156,24 +156,3 @@ The model was trained with [MultipleNegativesRankingLoss](https://www.sbert.net/
156
 
157
 
158
 
159
-
160
- | Dataset | Number of training tuples |
161
- |--------------------------------------------------------|:--------------------------:|
162
- | [WikiAnswers](https://github.com/afader/oqa#wikianswers-corpus) Duplicate question pairs from WikiAnswers | 77,427,422 |
163
- | [PAQ](https://github.com/facebookresearch/PAQ) Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia | 64,371,441 |
164
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Body) pairs from all StackExchanges | 25,316,456 |
165
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) (Title, Answer) pairs from all StackExchanges | 21,396,559 |
166
- | [MS MARCO](https://microsoft.github.io/msmarco/) Triplets (query, answer, hard_negative) for 500k queries from Bing search engine | 17,579,773 |
167
- | [GOOAQ: Open Question Answering with Diverse Answer Types](https://github.com/allenai/gooaq) (query, answer) pairs for 3M Google queries and Google featured snippet | 3,012,496 |
168
- | [Amazon-QA](http://jmcauley.ucsd.edu/data/amazon/qa/) (Question, Answer) pairs from Amazon product pages | 2,448,839
169
- | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Answer) pairs from Yahoo Answers | 1,198,260 |
170
- | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Question, Answer) pairs from Yahoo Answers | 681,164 |
171
- | [Yahoo Answers](https://www.kaggle.com/soumikrakshit/yahoo-answers-dataset) (Title, Question) pairs from Yahoo Answers | 659,896 |
172
- | [SearchQA](https://huggingface.co/datasets/search_qa) (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question | 582,261 |
173
- | [ELI5](https://huggingface.co/datasets/eli5) (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) | 325,475 |
174
- | [Stack Exchange](https://huggingface.co/datasets/flax-sentence-embeddings/stackexchange_xml) Duplicate questions pairs (titles) | 304,525 |
175
- | [Quora Question Triplets](https://quoradata.quora.com/First-Quora-Dataset-Release-Question-Pairs) (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset | 103,663 |
176
- | [Natural Questions (NQ)](https://ai.google.com/research/NaturalQuestions) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph | 100,231 |
177
- | [SQuAD2.0](https://rajpurkar.github.io/SQuAD-explorer/) (Question, Paragraph) pairs from SQuAD2.0 dataset | 87,599 |
178
- | [TriviaQA](https://huggingface.co/datasets/trivia_qa) (Question, Evidence) pairs | 73,346 |
179
- | **Total** | **214,988,242** |
 
7
  ---
8
 
9
  # multi-qa-MiniLM-L6-cos-v1
10
+ This is a [sentence-transformers](https://www.SBERT.net) model: It maps sentences & paragraphs to a 384 dimensional dense vector space and was designed for **semantic search**. It has been trained on 215M (question, answer) pairs from diverse sources.
11
 
12
 
13
  ## Usage (Sentence-Transformers)
14
+ The usage of this model is easy when you have [sentence-transformers](https://www.SBERT.net) installed:
15
 
16
  ```
17
  pip install -U sentence-transformers
18
  ```
19
 
20
+ Thus the model can be used as:
21
  ```python
22
  from sentence_transformers import SentenceTransformer, util
23
 
24
+ query = "Health Analytics?"
25
+ docs = ["The output is 3 top most similar sessions from the summit"]
26
 
27
  #Load the model
28
  model = SentenceTransformer('sentence-transformers/multi-qa-MiniLM-L6-cos-v1')
 
47
 
48
 
49
  ## Usage (HuggingFace Transformers)
50
+ Without [sentence-transformers](https://www.SBERT.net), you can take the following steps:
51
+ 1. Pass input through the transformer model,
52
+ 2. Apply the correct pooling-operation on-top of the contextualized word embeddings.
53
 
54
  ```python
55
+
56
  from transformers import AutoTokenizer, AutoModel
57
  import torch
58
  import torch.nn.functional as F
59
 
60
  #Mean Pooling - Take average of all tokens
61
  def mean_pooling(model_output, attention_mask):
62
+ token_embeddings = model_output.last_hidden_state #The first element of model_output containing all token embeddings
63
  input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
64
  return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)
65
 
 
83
 
84
 
85
  # Sentences we want sentence embeddings for
86
+ query = "Health Analytics?"
87
+ docs = ["The output is 3 top most similar sessions from the summit"]
88
 
89
  # Load model from HuggingFace Hub
90
  tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/multi-qa-MiniLM-L6-cos-v1")
 
129
  The project aims to train sentence embedding models on very large sentence level datasets using a self-supervised
130
  contrastive learning objective. We use a contrastive learning objective: given a sentence from the pair, the model should predict which out of a set of randomly sampled other sentences, was actually paired with it in our dataset.
131
 
132
+
 
 
 
133
 
134
  ## Intended uses
135
 
136
+ The model is intended to be used for semantic search at Nashville Analytics Summit: It encodes queries / questions and text paragraphs in a dense vector space. It finds relevant documents for the given passages.
137
 
138
  Note that there is a limit of 512 word pieces: Text longer than that will be truncated. Further note that the model was just trained on input text up to 250 word pieces. It might not work well for longer text.
139
 
 
141
 
142
  ## Training procedure
143
 
144
+ The full training script is accessible in: `train_script.py`.
145
 
146
  ### Pre-training
147
 
148
+ The pretrained [`nreimers/MiniLM-L6-H384-uncased`](https://huggingface.co/nreimers/MiniLM-L6-H384-uncased) model.
149
 
150
  #### Training
151
 
 
156
 
157
 
158