Multi QA MPNet base model for Semantic Search

This is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and was designed for semantic search. It has been trained on 215M (question, answer) pairs from diverse sources.

This model uses mpnet-base.

Training Data

We use the concatenation from multiple datasets to fine-tune this model. In total we have about 215M (question, answer) pairs. The model was trained with MultipleNegativesRankingLoss using Mean-pooling, cosine-similarity as similarity function, and a scale of 20.

Dataset Number of training tuples
WikiAnswers Duplicate question pairs from WikiAnswers 77,427,422
PAQ Automatically generated (Question, Paragraph) pairs for each paragraph in Wikipedia 64,371,441
Stack Exchange (Title, Body) pairs from all StackExchanges 25,316,456
Stack Exchange (Title, Answer) pairs from all StackExchanges 21,396,559
MS MARCO Triplets (query, answer, hard_negative) for 500k queries from Bing search engine 17,579,773
GOOAQ: Open Question Answering with Diverse Answer Types (query, answer) pairs for 3M Google queries and Google featured snippet 3,012,496
Amazon-QA (Question, Answer) pairs from Amazon product pages 2,448,839
Yahoo Answers (Title, Answer) pairs from Yahoo Answers 1,198,260
Yahoo Answers (Question, Answer) pairs from Yahoo Answers 681,164
Yahoo Answers (Title, Question) pairs from Yahoo Answers 659,896
SearchQA (Question, Answer) pairs for 140k questions, each with Top5 Google snippets on that question 582,261
ELI5 (Question, Answer) pairs from Reddit ELI5 (explainlikeimfive) 325,475
Stack Exchange Duplicate questions pairs (titles) 304,525
Quora Question Triplets (Question, Duplicate_Question, Hard_Negative) triplets for Quora Questions Pairs dataset 103,663
Natural Questions (NQ) (Question, Paragraph) pairs for 100k real Google queries with relevant Wikipedia paragraph 100,231
SQuAD2.0 (Question, Paragraph) pairs from SQuAD2.0 dataset 87,599
TriviaQA (Question, Evidence) pairs 73,346
Total 214,988,242

Technical Details

In the following some technical details how this model must be used:

Setting Value
Dimensions 768
Produces normalized embeddings Yes
Pooling-Method Mean pooling
Suitable score functions dot-product, cosine-similarity, or euclidean distance

Note: This model produces normalized embeddings with length 1. In that case, dot-product and cosine-similarity are equivalent. dot-product is preferred as it is faster. Euclidean distance is proportional to dot-product and can also be used.

Usage and Performance

The trained model can be used like this:

from sentence_transformers import SentenceTransformer, util

question = "That is a happy person"
contexts = [
    "That is a happy dog",
    "That is a very happy person",
    "Today is a sunny day"
]

# Load the model
model = SentenceTransformer('navteca//multi-qa-mpnet-base-cos-v1')

# Encode question and contexts
question_emb = model.encode(question)
contexts_emb = model.encode(contexts)

# Compute dot score between question and all contexts embeddings
result = util.dot_score(question_emb, contexts_emb)[0].cpu().tolist()

print(result)

#[
#  0.60806852579116820,
#  0.94949364662170410,
#  0.29836517572402954
#]
Downloads last month
13
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.