svdr-nq
Semi-Parametric Retrieval via Binary Token Index. Jiawei Zhou, Li Dong, Furu Wei, Lei Chen, arXiv 2024
The model is BERT-based with 12 layers and an embedding size of 20,523, derived from the BERT vocabulary of 30,522 with 999 unused tokens excluded.
Quick Start
Download and install vsearch
repo:
git clone git@github.com:jzhoubu/vsearch.git
poetry install
poetry shell
Below is an example to encode queries and passages and compute similarity.
import torch
from src.ir import Retriever
query = "Who first proposed the theory of relativity?"
passages = [
"Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.",
"Sir Isaac Newton FRS (25 December 1642 – 20 March 1727) was an English polymath active as a mathematician, physicist, astronomer, alchemist, theologian, and author who was described in his time as a natural philosopher.",
"Nikola Tesla (10 July 1856 – 7 January 1943) was a Serbian-American inventor, electrical engineer, mechanical engineer, and futurist. He is known for his contributions to the design of the modern alternating current (AC) electricity supply system."
]
ir = Retriever.from_pretrained("vsearch/svdr-msmarco")
ir = ir.to("cuda")
# Embed the query and passages
q_emb = ir.encoder_q.embed(query) # Shape: [1, V]
p_emb = ir.encoder_p.embed(passages) # Shape: [4, V]
scores = q_emb @ p_emb.t()
print(scores)
# Output:
# tensor([[97.2964, 39.7844, 37.6955]], device='cuda:0')
Building Embedding-based Index for Search
Below are examples to build index for large-scale retrieval
# Build the sparse index for the passages
ir.build_index(passages, index_type="sparse")
print(ir.index)
# Output:
# Index Type : SparseIndex
# Vector Type : torch.sparse_csr
# Vector Shape : torch.Size([3, 29523])
# Vector Device : cuda:0
# Number of Texts : 3
# Save the index to disk
index_file = "/path/to/index.npz"
ir.save_index(path)
# Load the index from disk
index_file = "/path/to/index.npz"
data_file = "/path/to/texts.jsonl"
ir.load_index(index_file=index_file, data_file=data_file)
# Search top-k results for queries
queries = [query]
results = ir.retrieve(queries, k=3)
print(results)
# Output:
# SearchResults(
# ids=tensor([[0, 1, 2]], device='cuda:0'),
# scores=tensor([[97.2458, 39.7507, 37.6407]], device='cuda:0')
# )
query_id = 0
top1_psg_id = results.ids[query_id][0]
top1_psg = ir.index.get_sample(top1_psg_id)
print(top1_psg)
# Output:
# Albert Einstein (14 March 1879 – 18 April 1955) was a German-born theoretical physicist who is widely held to be one of the greatest and most influential scientists of all time. He is best known for developing the theory of relativity.
Building Bag-of-token Index for Search
Our framework supports using tokenization as an index (i.e., a bag-of-token index), which operates on CPU and reduces indexing time and storage requirements by over 90%, compare to an embedding-based index.
# Build the bag-of-token index for the passages
ir.build_index(passages, index_type="bag_of_token")
print(ir.index)
# Output:
# Index Type : BoTIndex
# Vector Type : torch.sparse_csr
# Vector Shape : torch.Size([3, 29523])
# Vector Device : cuda:0
# Number of Texts : 3
# Search top-k results from bag-of-token index, and embed and rerank them on-the-fly
queries = [query]
results = ir.retrieve(queries, k=3, rerank=True)
print(results)
# Output:
# SearchResults(
# ids=tensor([0, 2, 1], device='cuda:3'),
# scores=tensor([97.2964, 39.7844, 37.6955], device='cuda:0')
# )
Training Details
Please refer to our paper at https://arxiv.org/pdf/2405.01924.
Citation
If you find our paper or models helpful, please consider cite as follows:
@article{zhou2024semi,
title={Semi-Parametric Retrieval via Binary Token Index},
author={Zhou, Jiawei and Dong, Li and Wei, Furu and Chen, Lei},
journal={arXiv preprint arXiv:2405.01924},
year={2024}
}
- Downloads last month
- 492
Model tree for vsearch/svdr-msmarco
Base model
google-bert/bert-base-uncased