omidf's picture
Duplicate from evaluate-metric/squad
63e7fe5
|
raw
history blame
5.14 kB
metadata
title: SQuAD
emoji: 🤗
colorFrom: blue
colorTo: red
sdk: gradio
sdk_version: 3.0.2
app_file: app.py
pinned: false
tags:
  - evaluate
  - metric
description: >-
  This metric wrap the official scoring script for version 1 of the Stanford
  Question Answering Dataset (SQuAD).

  Stanford Question Answering Dataset (SQuAD) is a reading comprehension
  dataset, consisting of questions posed by crowdworkers on a set of Wikipedia
  articles, where the answer to every question is a segment of text, or span,
  from the corresponding reading passage, or the question might be unanswerable.
duplicated_from: evaluate-metric/squad

Metric Card for SQuAD

Metric description

This metric wraps the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD).

SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable.

How to use

The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to:

from evaluate import load
squad_metric = load("squad")
results = squad_metric.compute(predictions=predictions, references=references)

Output values

This metric outputs a dictionary with two values: the average exact match score and the average F1 score.

{'exact_match': 100.0, 'f1': 100.0}

The range of exact_match is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched.

The range of f1 is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall.

Values from popular papers

The original SQuAD paper reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%.

For more recent model performance, see the dataset leaderboard.

Examples

Maximal values for both exact match and F1 (perfect match):

from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
results
{'exact_match': 100.0, 'f1': 100.0}

Minimal values for both exact match and F1 (no match):

from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}]
results = squad_metric.compute(predictions=predictions, references=references)
results
{'exact_match': 0.0, 'f1': 0.0}

Partial match (2 out of 3 answers correct) :

from evaluate import load
squad_metric = load("squad")
predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'},  {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}]
references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}]
results = squad_metric.compute(predictions=predictions, references=references)
results
{'exact_match': 66.66666666666667, 'f1': 66.66666666666667}

Limitations and bias

This metric works only with datasets that have the same format as SQuAD v.1 dataset.

The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers.

Citation

@inproceedings{Rajpurkar2016SQuAD10,
title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text},
author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang},
booktitle={EMNLP},
year={2016}
}

Further References