svdr-nq

Semi-Parametric Retrieval via Binary Token Index. Jiawei Zhou, Li Dong, Furu Wei, Lei Chen, arXiv 2024

The model is BERT-based with 12 layers and an embedding size of 20,523, derived from the BERT vocabulary of 30,522 with 999 unused tokens excluded.

Quick Start

Download and install vsearch repo:

git clone git@github.com:jzhoubu/vsearch.git
poetry install
poetry shell

Below is an example to encode queries and passages and compute similarity.

import torch
from src.ir import Retriever

query = "Who first proposed the theory of relativity?"
passages = [
    "Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.",
    "Sir Isaac Newton FRS (25 December 1642 – 20 March 1727) was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.",
    "Nikola Tesla (10 July 1856 – 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is known for his contributions to the design of the modern alternating current (AC) electricity supply system."
]

ir = Retriever.from_pretrained("vsearch/svdr-msmarco")
ir = ir.to("cuda")

# Embed the query and passages
q_emb = ir.encoder_q.embed(query)  # Shape: [1, V]
p_emb = ir.encoder_p.embed(passages)  # Shape: [4, V]

scores = q_emb @ p_emb.t()
print(scores)

# Output: 
# tensor([[97.2964, 39.7844, 37.6955]], device='cuda:0')

Building Embedding-based Index for Search

Below are examples to build index for large-scale retrieval

# Build the sparse index for the passages
ir.build_index(passages, index_type="sparse")
print(ir.index)

# Output:
# Index Type      : SparseIndex
# Vector Type     : torch.sparse_csr
# Vector Shape    : torch.Size([3, 29523])
# Vector Device   : cuda:0
# Number of Texts : 3

# Save the index to disk
index_file = "/path/to/index.npz"
ir.save_index(path)

# Load the index from disk
index_file = "/path/to/index.npz"
data_file = "/path/to/texts.jsonl"
ir.load_index(index_file=index_file, data_file=data_file)

# Search top-k results for queries
queries = [query]
results = ir.retrieve(queries, k=3)
print(results)

# Output:
# SearchResults(
#   ids=tensor([[0, 1, 2]], device='cuda:0'),
#   scores=tensor([[97.2458, 39.7507, 37.6407]], device='cuda:0')
# )

query_id = 0
top1_psg_id = results.ids[query_id][0]
top1_psg = ir.index.get_sample(top1_psg_id)
print(top1_psg)
# Output:

# Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.

Building Bag-of-token Index for Search

Our framework supports using tokenization as an index (i.e., a bag-of-token index), which operates on CPU and reduces indexing time and storage requirements by over 90%, compare to an embedding-based index.

# Build the bag-of-token index for the passages
ir.build_index(passages, index_type="bag_of_token")
print(ir.index)

# Output:
# Index Type      : BoTIndex
# Vector Type     : torch.sparse_csr
# Vector Shape    : torch.Size([3, 29523])
# Vector Device   : cuda:0
# Number of Texts : 3

# Search top-k results from bag-of-token index, and embed and rerank them on-the-fly
queries = [query]
results = ir.retrieve(queries, k=3, rerank=True)
print(results)

# Output:
# SearchResults(
#   ids=tensor([0, 2, 1], device='cuda:3'), 
#   scores=tensor([97.2964, 39.7844, 37.6955], device='cuda:0')
# )

Training Details

Please refer to our paper at https://arxiv.org/pdf/2405.01924.

Citation

If you find our paper or models helpful, please consider cite as follows:

@article{zhou2024semi,
  title={Semi-Parametric Retrieval via Binary Token Index},
  author={Zhou, Jiawei and Dong, Li and Wei, Furu and Chen, Lei},
  journal={arXiv preprint arXiv:2405.01924},
  year={2024}
}

vsearch
/

svdr-msmarco