terrierteam/doc2query · Is it possible to load the generated index from a folder?

May 10, 2023

Hi! I found this project by chance and liked it a lot.

However, it's worth noting that generating an index can be quite expensive (in time and computation) and I haven't found a way to load a PyTerrier index from a file after generation.
For example, I was testing with doc2query for BeIR/trec-covid dataset on Google Colab with a T4 GPU. The index/query generation step took some 5 hours, but I wasn't able to retrieve the index once my runtime shutdown due to inactivity. Is it possible to do that?

craigmacdonald

Terrier Team org May 10, 2023

The indexing pipeline described https://github.com/terrierteam/pyterrier_doc2query#using-doc2query-for-indexing is decomposable. That could allow you to break the indexing into chunks?

Here's a sketch of an indexing pipeline that writes the doc2query to disk, and then reloads it for indexing:

import json
fout = pt.io.autoopen("doc2query.gz", "wt")
doc2query_writer = doc2query >> pt.apply.generic(lambda df: df[['docno', 'querygen']].to_json(fout)) ##TODO refine to_json arguments here
doc2query_writer.transform_iter(dataset.get_corpus_iter())
close(fout)

def _records():
  import json
  with pt.io.autoopen("doc2query.gz", "rt") as fin:
    for jsonl in fin:
      yield(json.loads(jsonl)

pt.IterDictIndexer(...).index(_records())

craigmacdonald

Terrier Team org May 10, 2023

BTW, with a TPU you may be able to increase the batch_size argument to increase speed.