Edit model card

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

csdc-atl/doc2query

This is a doc2query model based on T5 (also known as docT5query).

It can be used for:

  • Document expansion: You generate for your paragraphs 20-40 queries and index the paragraphs and the generates queries in a standard BM25 index like Elasticsearch, OpenSearch, or Lucene. The generated queries help to close the lexical gap of lexical search, as the generate queries contain synonyms. Further, it re-weights words giving important words a higher weight even if they appear seldomn in a paragraph. In our BEIR paper we showed that BM25+docT5query is a powerful search engine. In the BEIR repository we have an example how to use docT5query with Pyserini.
  • Domain Specific Training Data Generation: It can be used to generate training data to learn an embedding model. In our GPL-Paper / GPL Example on SBERT.net we have an example how to use the model to generate (query, text) pairs for a given collection of unlabeled texts. These pairs can then be used to train powerful dense embedding models.

Usage

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
import torch

model_name = 'csdc-atl/doc2query'
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

text = "2014年12月9日,于洋转会至中超联赛球队广州富力。2015赛季初,于洋并没有出场机会。韩国中后卫张贤秀受伤后,主教练选择用金洋洋代替。足协杯4比0战胜贵州人和的比赛中,金洋洋打入两球。赛后,中国足协认定金洋洋在庆祝进球时使用侮辱性手势,将他禁赛四场。之后对阵山东鲁能的联赛,于洋迎来出场机会,首次代表广州富力出战正式比赛。从此开始,于洋得到了较为充足的出场时间。2015赛季于洋中超联赛出场17次、亚冠联赛1次,这18次出场中有17次为首发。2016赛季,于洋成为广州富力三后卫体系的主力,还曾担任队长。这个赛季,他在中超联赛出场25次、足协杯出场5次,联赛的25次出场中含22次首发。效力广州富力期间,他于2015年重返中国国家足球队。\n2016年12月30日,广州富力宣布于洋离队,加盟北京国安。有媒体透露,转会费在5000万至6000万元人民币之间。回归北京国安之后,于洋成为中后卫位置上的主力。2018年3月31日北京国安与北京人和的北京德比上,于洋第100次代表北京国安出场。他在比赛中打入一球,助球队4比0获胜。"


def create_queries(para):
    input_ids = tokenizer.encode(para, return_tensors='pt')
    with torch.no_grad():
        # Here we use top_k / top_k random sampling. It generates more diverse queries, but of lower quality
        sampling_outputs = model.generate(
            input_ids=input_ids,
            max_length=64,
            do_sample=True,
            top_p=0.95,
            top_k=10, 
            num_return_sequences=10
            )

    print("Paragraph:")
    print(para)

    print("\nSampling Outputs:")
    for i in range(len(sampling_outputs)):
        query = tokenizer.decode(sampling_outputs[i], skip_special_tokens=True)
        print(f'{i + 1}: {query}')

create_queries(text)

# 1: 于洋在2016年重返中国国家足球队是在哪个球队效力?
# 2: 于洋在2018年3月31日的北京德比上打入了几个球?
# 3: 于洋在哪些比赛中有出场机会?
# 4: 于洋在哪个比赛中打入了两球?
# 5: 于洋在2015赛季中超联赛中出场次数和亚冠联赛中的首发次数分别是多少?
# 6: 于洋在哪个比赛中打入了两球,帮助球队赢了这场比赛?
# 7: 于洋在2018年3月31日北京国安与北京人和的北京德比上打进了几个进球?
# 8: 于洋在2015赛季中超联赛和亚冠联赛中出场次数分别是多少?
# 9: 于洋在广州富力期间曾担任什么职位?
# 10: 于洋在哪些比赛中有出场机会?

Note: model.generate() is non-deterministic for top_k/top_n sampling. It produces different queries each time you run it.

Training

This model fine-tuned Langboat/mengzi-t5-base.

The input-text was truncated to 768 word pieces. Output text was generated up to 64 word pieces.

This model was trained on a (query, positive, negative) from the CSDC query document retrieval dataset.

Downloads last month
0

Dataset used to train csdc-atl/doc2query