--- title: SQuAD emoji: 🤗 colorFrom: blue colorTo: red sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false tags: - evaluate - metric description: >- This metric wrap the official scoring script for version 1 of the Stanford Question Answering Dataset (SQuAD). Stanford Question Answering Dataset (SQuAD) is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. --- # Metric Card for SQuAD ## Metric description This metric wraps the official scoring script for version 1 of the [Stanford Question Answering Dataset (SQuAD)](https://huggingface.co/datasets/squad). SQuAD is a reading comprehension dataset, consisting of questions posed by crowdworkers on a set of Wikipedia articles, where the answer to every question is a segment of text, or span, from the corresponding reading passage, or the question might be unanswerable. ## How to use The metric takes two files or two lists of question-answers dictionaries as inputs : one with the predictions of the model and the other with the references to be compared to: ```python from evaluate import load squad_metric = load("squad") results = squad_metric.compute(predictions=predictions, references=references) ``` ## Output values This metric outputs a dictionary with two values: the average exact match score and the average [F1 score](https://huggingface.co/metrics/f1). ``` {'exact_match': 100.0, 'f1': 100.0} ``` The range of `exact_match` is 0-100, where 0.0 means no answers were matched and 100.0 means all answers were matched. The range of `f1` is 0-1 -- its lowest possible value is 0, if either the precision or the recall is 0, and its highest possible value is 1.0, which means perfect precision and recall. ### Values from popular papers The [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an F1 score of 51.0% and an Exact Match score of 40.0%. They also report that human performance on the dataset represents an F1 score of 90.5% and an Exact Match score of 80.3%. For more recent model performance, see the [dataset leaderboard](https://paperswithcode.com/dataset/squad). ## Examples Maximal values for both exact match and F1 (perfect match): ```python from evaluate import load squad_metric = load("squad") predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}] references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}] results = squad_metric.compute(predictions=predictions, references=references) results {'exact_match': 100.0, 'f1': 100.0} ``` Minimal values for both exact match and F1 (no match): ```python from evaluate import load squad_metric = load("squad") predictions = [{'prediction_text': '1999', 'id': '56e10a3be3433e1400422b22'}] references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}] results = squad_metric.compute(predictions=predictions, references=references) results {'exact_match': 0.0, 'f1': 0.0} ``` Partial match (2 out of 3 answers correct) : ```python from evaluate import load squad_metric = load("squad") predictions = [{'prediction_text': '1976', 'id': '56e10a3be3433e1400422b22'}, {'prediction_text': 'Beyonce', 'id': '56d2051ce7d4791d0090260b'}, {'prediction_text': 'climate change', 'id': '5733b5344776f419006610e1'}] references = [{'answers': {'answer_start': [97], 'text': ['1976']}, 'id': '56e10a3be3433e1400422b22'}, {'answers': {'answer_start': [233], 'text': ['Beyoncé and Bruno Mars']}, 'id': '56d2051ce7d4791d0090260b'}, {'answers': {'answer_start': [891], 'text': ['climate change']}, 'id': '5733b5344776f419006610e1'}] results = squad_metric.compute(predictions=predictions, references=references) results {'exact_match': 66.66666666666667, 'f1': 66.66666666666667} ``` ## Limitations and bias This metric works only with datasets that have the same format as [SQuAD v.1 dataset](https://huggingface.co/datasets/squad). The SQuAD dataset does contain a certain amount of noise, such as duplicate questions as well as missing answers, but these represent a minority of the 100,000 question-answer pairs. Also, neither exact match nor F1 score reflect whether models do better on certain types of questions (e.g. who questions) or those that cover a certain gender or geographical area -- carrying out more in-depth error analysis can complement these numbers. ## Citation @inproceedings{Rajpurkar2016SQuAD10, title={SQuAD: 100, 000+ Questions for Machine Comprehension of Text}, author={Pranav Rajpurkar and Jian Zhang and Konstantin Lopyrev and Percy Liang}, booktitle={EMNLP}, year={2016} } ## Further References - [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/) - [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)