Spaces:
Runtime error
Is it possible to load the generated index from a folder?
Hi! I found this project by chance and liked it a lot.
However, it's worth noting that generating an index can be quite expensive (in time and computation) and I haven't found a way to load a PyTerrier index from a file after generation.
For example, I was testing with doc2query for BeIR/trec-covid dataset on Google Colab with a T4 GPU. The index/query generation step took some 5 hours, but I wasn't able to retrieve the index once my runtime shutdown due to inactivity. Is it possible to do that?
The indexing pipeline described https://github.com/terrierteam/pyterrier_doc2query#using-doc2query-for-indexing is decomposable. That could allow you to break the indexing into chunks?
Here's a sketch of an indexing pipeline that writes the doc2query to disk, and then reloads it for indexing:
import json
fout = pt.io.autoopen("doc2query.gz", "wt")
doc2query_writer = doc2query >> pt.apply.generic(lambda df: df[['docno', 'querygen']].to_json(fout)) ##TODO refine to_json arguments here
doc2query_writer.transform_iter(dataset.get_corpus_iter())
close(fout)
def _records():
import json
with pt.io.autoopen("doc2query.gz", "rt") as fin:
for jsonl in fin:
yield(json.loads(jsonl)
pt.IterDictIndexer(...).index(_records())
BTW, with a TPU you may be able to increase the batch_size argument to increase speed.