File size: 3,262 Bytes
5b8a68e
 
 
 
 
7c49f79
597d602
5225d95
 
a6d91ab
5225d95
a6d91ab
5225d95
a6d91ab
5b8a68e
 
901ad6e
5b8a68e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
---
language:
- en
tags:
- retrieval
- document-expansion
- translation
widget:
- text: "The presence of communication amid scientific minds was equally important to the success of the Manhattan Project as scientific intellect was. The only cloud hanging over the impressive achievement of the atomic researchers and engineers is what their success truly meant; hundreds of thousands of innocent lives obliterated."
  example_title: "msmarco-passage"
- text: "0-60 Times - 0-60 | 0 to 60 Times & 1/4 Mile Times | Zero to 60 Car Reviews."
  example_title: "msmarco-passage-v2"
- text: "A small group of politicians believed strongly that the fact that Saddam Hussien remained in power after the first Gulf War was a signal of weakness to the rest of the world, one that invited attacks and terrorism. Shortly after taking power with George Bush in 2000 and after the attack on 9/11, they were able to use the terrorist attacks to justify war with Iraq on this basis and exaggerated threats of the development of weapons of mass destruction. The military strength of the U.S. and the brutality of Saddam's regime led them to imagine that the military and political victory would be relatively easy."
  example_title: "antique"
datasets:
- irds:msmarco-passage
library_name: transformers
---

A Doc2Query model based on `t5-base` and trained on MS MARCO. This is a version of [the checkpoint released by the original authors](https://git.uwaterloo.ca/jimmylin/doc2query-data/raw/master/T5-passage/t5-base.zip), converted to pytorch format and ready for use in [`pyterrier_doc2query`](https://github.com/terrierteam/pyterrier_doc2query).

**Creating a transformer:**

```python
import pyterrier as pt
pt.init()
from pyterrier_doc2query import Doc2Query
doc2query = Doc2Query('macavaney/doc2query-t5-base-msmarco')
```

**Transforming documents**

```python
import pandas as pd
doc2query(pd.DataFrame([
  {'docno': '0', 'text': 'Hello Terrier!'},
  {'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
]))
# docno                                               text                                           querygen
#    0                                     Hello Terrier!  hello terrier what kind of dog is a terrier wh...
#    1  Doc2Query expands queries with potentially rel...  can dodoc2query extend query query? what is do...
```

**Indexing transformed documents**

```python
doc2query.append = True # append querygen to text
indexer = pt.IterDictIndexer('./my_index', fields=['text'])
pipeline = doc2query >> indexer
pipeline.index([
  {'docno': '0', 'text': 'Hello Terrier!'},
  {'docno': '1', 'text': 'Doc2Query expands queries with potentially relevant queries.'},
])
```

**Expanding and indexing a dataset**

```python
dataset = pt.get_dataset('irds:vaswani')
pipeline.index(dataset.get_corpus_iter())
```

## References

  - [Nogueira20]: Rodrigo Nogueira and Jimmy Lin. From doc2query to docTTTTTquery. https://cs.uwaterloo.ca/~jimmylin/publications/Nogueira_Lin_2019_docTTTTTquery-v2.pdf
  - [Macdonald20]: Craig Macdonald, Nicola Tonellotto. Declarative Experimentation inInformation Retrieval using PyTerrier. Craig Macdonald and Nicola Tonellotto. In Proceedings of ICTIR 2020. https://arxiv.org/abs/2007.14271