Multi QA MPNet base model for Semantic Search

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.

This model uses mpnet-base.

Training Data

We use the concatenation from multiple datasets to fine-tune this model. In total we have about 215M (question, answer) pairs. The model was trained with MultipleNegativesRankingLoss using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.

Dataset	Number of training tuples
WikiAnswers Duplicate question pairs from WikiAnswers	77,427,422
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia	64,371,441
Stack Exchange (Title, Body) pairs from all StackExchanges	25,316,456
Stack Exchange (Title, Answer) pairs from all StackExchanges	21,396,559
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine	17,579,773
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet	3,012,496
Amazon-QA (Question, Answer) pairs from Amazon product pages	2,448,839
Yahoo Answers (Title, Answer) pairs from Yahoo Answers	1,198,260
Yahoo Answers (Question, Answer) pairs from Yahoo Answers	681,164
Yahoo Answers (Title, Question) pairs from Yahoo Answers	659,896
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question	582,261
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive)	325,475
Stack Exchange Duplicate questions pairs (titles)	304,525
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset	103,663
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph	100,231
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset	87,599
TriviaQA (Question, Evidence) pairs	73,346
Total	214,988,242

Technical Details

In the following some technical details how this model must be used:

Setting	Value
Dimensions	768
Produces normalized embeddings	Yes
Pooling-Method	Mean pooling
Suitable score functions	dot-product, cosine-similarity, or euclidean distance

Note: This model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used.

Usage and Performance

The trained model can be used like this:

from sentence_transformers import SentenceTransformer, util

question = "That is a happy person"
contexts = [
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]

# Load the model
model = SentenceTransformer('navteca//multi-qa-mpnet-base-cos-v1')

# Encode question and contexts
question_emb = model.encode(question)
contexts_emb = model.encode(contexts)

# Compute dot score between question and all contexts embeddings
result = util.dot_score(question_emb, contexts_emb)[0].cpu().tolist()

print(result)

#[
#  0.60806852579116820,
#  0.94949364662170410,
#  0.29836517572402954
#]