ccsasuke's picture
Update README.md
e7ec085
metadata
tags:
  - feature-extraction
pipeline_tag: feature-extraction

This model is the query encoder of the MS MARCO UniCOIL Lexical Model (螞) from the SPAR paper:

Salient Phrase Aware Dense Retrieval: Can a Dense Retriever Imitate a Sparse One?
Xilun Chen, Kushal Lakhotia, Barlas O臒uz, Anchit Gupta, Patrick Lewis, Stan Peshterliev, Yashar Mehdad, Sonal Gupta and Wen-tau Yih
Meta AI

The associated github repo is available here: https://github.com/facebookresearch/dpr-scale/tree/main/spar

This model is a BERT-base sized dense retriever trained on the MS MARCO corpus to imitate the behavior of UniCOIL, a sparse retriever. The following models are also available:

Pretrained Model Corpus Teacher Architecture Query Encoder Path Context Encoder Path
Wiki BM25 螞 Wikipedia BM25 BERT-base facebook/spar-wiki-bm25-lexmodel-query-encoder facebook/spar-wiki-bm25-lexmodel-context-encoder
PAQ BM25 螞 PAQ BM25 BERT-base facebook/spar-paq-bm25-lexmodel-query-encoder facebook/spar-paq-bm25-lexmodel-context-encoder
MARCO BM25 螞 MS MARCO BM25 BERT-base facebook/spar-marco-bm25-lexmodel-query-encoder facebook/spar-marco-bm25-lexmodel-context-encoder
MARCO UniCOIL 螞 MS MARCO UniCOIL BERT-base facebook/spar-marco-unicoil-lexmodel-query-encoder facebook/spar-marco-unicoil-lexmodel-context-encoder

Using the Lexical Model (螞) Alone

This model should be used together with the associated query encoder, similar to the DPR model.

import torch
from transformers import AutoTokenizer, AutoModel

# The tokenizer is the same for the query and context encoder
tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')

query = "Where was Marie Curie born?"
contexts = [
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eug猫ne Curie, a doctor of French Catholic origin from Alsace."
]

# Apply tokenizer
query_input = tokenizer(query, return_tensors='pt')
ctx_input = tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')

# Compute embeddings: take the last-layer hidden state of the [CLS] token
query_emb = query_encoder(**query_input).last_hidden_state[:, 0, :]
ctx_emb = context_encoder(**ctx_input).last_hidden_state[:, 0, :]

# Compute similarity scores using dot product
score1 = query_emb @ ctx_emb[0]  # 341.3268
score2 = query_emb @ ctx_emb[1]  # 340.1626

Using the Lexical Model (螞) with a Base Dense Retriever as in SPAR

As 螞 learns lexical matching from a sparse teacher retriever, it can be used in combination with a standard dense retriever (e.g. DPR, Contriever) to build a dense retriever that excels at both lexical and semantic matching.

In the following example, we show how to build the SPAR-Wiki model for Open-Domain Question Answering by concatenating the embeddings of DPR and the Wiki BM25 螞.

import torch
from transformers import AutoTokenizer, AutoModel
from transformers import DPRQuestionEncoder, DPRQuestionEncoderTokenizer
from transformers import DPRContextEncoder, DPRContextEncoderTokenizer

# DPR model
dpr_ctx_tokenizer = DPRContextEncoderTokenizer.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_ctx_encoder = DPRContextEncoder.from_pretrained("facebook/dpr-ctx_encoder-multiset-base")
dpr_query_tokenizer = DPRQuestionEncoderTokenizer.from_pretrained("facebook/dpr-question_encoder-multiset-base")
dpr_query_encoder = DPRQuestionEncoder.from_pretrained("facebook/dpr-question_encoder-multiset-base")

# Wiki BM25 螞 model
lexmodel_tokenizer = AutoTokenizer.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_query_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-query-encoder')
lexmodel_context_encoder = AutoModel.from_pretrained('facebook/spar-wiki-bm25-lexmodel-context-encoder')

query = "Where was Marie Curie born?"
contexts = [
    "Maria Sklodowska, later known as Marie Curie, was born on November 7, 1867.",
    "Born in Paris on 15 May 1859, Pierre Curie was the son of Eug猫ne Curie, a doctor of French Catholic origin from Alsace."
]

# Compute DPR embeddings
dpr_query_input = dpr_query_tokenizer(query, return_tensors='pt')['input_ids']
dpr_query_emb = dpr_query_encoder(dpr_query_input).pooler_output
dpr_ctx_input = dpr_ctx_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
dpr_ctx_emb = dpr_ctx_encoder(**dpr_ctx_input).pooler_output

# Compute 螞 embeddings
lexmodel_query_input = lexmodel_tokenizer(query, return_tensors='pt')
lexmodel_query_emb = lexmodel_query_encoder(**query_input).last_hidden_state[:, 0, :]
lexmodel_ctx_input = lexmodel_tokenizer(contexts, padding=True, truncation=True, return_tensors='pt')
lexmodel_ctx_emb = lexmodel_context_encoder(**ctx_input).last_hidden_state[:, 0, :]

# Form SPAR embeddings via concatenation

# The concatenation weight is only applied to query embeddings
# Refer to the SPAR paper for details
concat_weight = 0.7

spar_query_emb = torch.cat(
    [dpr_query_emb, concat_weight * lexmodel_query_emb],
    dim=-1,
    )
spar_ctx_emb = torch.cat(
    [dpr_ctx_emb, lexmodel_ctx_emb],
    dim=-1,
)

# Compute similarity scores
score1 = spar_query_emb @ spar_ctx_emb[0]  # 317.6931
score2 = spar_query_emb @ spar_ctx_emb[1]  # 314.6144