Edit model card
YAML Metadata Error: "datasets[0]" with value "irds:msmarco-passage" is not valid. If possible, use a dataset id from https://hf.co/datasets.

A Doc2Query model based on t5-base and trained on MS MARCO. This is a version of the checkpoint released by the original authors, converted to pytorch format and ready for use in pyterrier_doc2query.

Creating a transformer:

import pyterrier as pt
pt.init()
from pyterrier_doc2query import Doc2Query
doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco')

Transforming documents

import pandas as pd
doc2query(pd.DataFrame([
  {'docno': '0', 'text': 'Hello Terrier!'},
  {'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
]))
# docno                                               text                                           querygen
#    0                                     Hello Terrier!  hello terrier what kind of dog is a terrier wh...
#    1  Doc2Query expands queries with potentially rel...  can dodoc2query extend query query? what is do...

Indexing transformed documents

doc2query.append = True # append querygen to text
indexer = pt.IterDictIndexer('./my_index', fields=['text'])
pipeline = doc2query >> indexer
pipeline.index([
  {'docno': '0', 'text': 'Hello Terrier!'},
  {'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
])

Expanding and indexing a dataset

dataset = pt.get_dataset('irds:vaswani')
pipeline.index(dataset.get_corpus_iter())

References

Downloads last month
2,303
Safetensors
Model size
223M params
Tensor type
F32
ยท
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Space using macavaney/doc2query-t5-base-msmarco 1