metadata

license: bsd-3-clause
datasets:
  - mocha
language:
  - en

Answer Overlap Module of QAFactEval Metric

This is the span scorer module, used in RQUGE paper to evaluate the generated questions of the question generation task. The model was originally used in QAFactEval for computing the semantic similarity of the generated answer span, given the reference answer, context, and question in the question answering task. It outputs a 1-5 answer overlap score. The scorer is trained on their MOCHA dataset (initialized from Jia et al. (2021)), consisting of 40k crowdsourced judgments on QA model outputs.

The input to the model is defined as:

[CLS] question [q] gold answer [r] pred answer [c] context

Generation

You can use the following script to get the semantic similarity of the predicted answer given the gold answer, context, and question.

from transformers import AutoModelForSequenceClassification, AutoTokenizer
sp_scorer = AutoModelForSequenceClassification.from_pretrained('alirezamsh/quip-512-mocha')
tokenizer_sp = AutoTokenizer.from_pretrained('alirezamsh/quip-512-mocha')
sp_scorer.eval()

pred_answer = ""
gold_answer = ""
question = ""
context = ""

input_sp = f"{question} <q> {gold_answer} <r>" \
                   f" {pred_answer} <c> {context}"

inputs = tokenizer_sp(input_sp, max_length=512, truncation=True, \
                                   padding="max_length", return_tensors="pt")

outputs = sp_scorer(input_ids=inputs["input_ids"], attention_mask=inputs["attention_mask"])
print(outputs)

Citations

@inproceedings{fabbri-etal-2022-qafacteval,
    title = "{QAF}act{E}val: Improved {QA}-Based Factual Consistency Evaluation for Summarization",
    author = "Fabbri, Alexander  and
      Wu, Chien-Sheng  and
      Liu, Wenhao  and
      Xiong, Caiming",
    booktitle = "Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies",
    month = jul,
    year = "2022",
    address = "Seattle, United States",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.187",
    doi = "10.18653/v1/2022.naacl-main.187",
    pages = "2587--2601",
    abstract = "Factual consistency is an essential quality of text summarization models in practical settings. Existing work in evaluating this dimension can be broadly categorized into two lines of research, entailment-based and question answering (QA)-based metrics, and different experimental setups often lead to contrasting conclusions as to which paradigm performs the best. In this work, we conduct an extensive comparison of entailment and QA-based metrics, demonstrating that carefully choosing the components of a QA-based metric, especially question generation and answerability classification, is critical to performance. Building on those insights, we propose an optimized metric, which we call QAFactEval, that leads to a 14{\%} average improvement over previous QA-based metrics on the SummaC factual consistency benchmark, and also outperforms the best-performing entailment-based metric. Moreover, we find that QA-based and entailment-based metrics can offer complementary signals and be combined into a single metric for a further performance boost.",
}

@misc{mohammadshahi2022rquge,
    title={RQUGE: Reference-Free Metric for Evaluating Question Generation by Answering the Question},
    author={Alireza Mohammadshahi and Thomas Scialom and Majid Yazdani and Pouya Yanki and Angela Fan and James Henderson and Marzieh Saeidi},
    year={2022},
    eprint={2211.01482},
    archivePrefix={arXiv},
    primaryClass={cs.CL}
}