--- title: Exact Match emoji: 🤗 colorFrom: blue colorTo: red sdk: gradio sdk_version: 3.19.1 app_file: app.py pinned: false tags: - evaluate - metric description: >- Returns the rate at which the input predicted strings exactly match their references, ignoring any strings input as part of the regexes_to_ignore list. --- # Metric Card for Exact Match ## Metric Description A given predicted string's exact match score is 1 if it is the exact same as its reference string, and is 0 otherwise. - **Example 1**: The exact match score of prediction "Happy Birthday!" is 0, given its reference is "Happy New Year!". - **Example 2**: The exact match score of prediction "The Colour of Magic (1983)" is 1, given its reference is also "The Colour of Magic (1983)". The exact match score of a set of predictions is the sum of all of the individual exact match scores in the set, divided by the total number of predictions in the set. - **Example**: The exact match score of the set {Example 1, Example 2} (above) is 0.5. ## How to Use At minimum, this metric takes as input predictions and references: ```python >>> from evaluate import load >>> exact_match_metric = load("exact_match") >>> results = exact_match_metric.compute(predictions=predictions, references=references) ``` ### Inputs - **`predictions`** (`list` of `str`): List of predicted texts. - **`references`** (`list` of `str`): List of reference texts. - **`regexes_to_ignore`** (`list` of `str`): Regex expressions of characters to ignore when calculating the exact matches. Defaults to `None`. Note: the regex changes are applied before capitalization is normalized. - **`ignore_case`** (`bool`): If `True`, turns everything to lowercase so that capitalization differences are ignored. Defaults to `False`. - **`ignore_punctuation`** (`bool`): If `True`, removes punctuation before comparing strings. Defaults to `False`. - **`ignore_numbers`** (`bool`): If `True`, removes all digits before comparing strings. Defaults to `False`. ### Output Values This metric outputs a dictionary with one value: the average exact match score. ```python {'exact_match': 1.0} ``` This metric's range is 0-1, inclusive. Here, 0.0 means no prediction/reference pairs were matches, while 1.0 means they all were. #### Values from Popular Papers The exact match metric is often included in other metrics, such as SQuAD. For example, the [original SQuAD paper](https://nlp.stanford.edu/pubs/rajpurkar2016squad.pdf) reported an Exact Match score of 40.0%. They also report that the human performance Exact Match score on the dataset was 80.3%. ### Examples Without including any regexes to ignore: ```python >>> exact_match = evaluate.load("exact_match") >>> refs = ["the cat", "theater", "YELLING", "agent007"] >>> preds = ["cat?", "theater", "yelling", "agent"] >>> results = exact_match.compute(references=refs, predictions=preds) >>> print(round(results["exact_match"], 2)) 0.25 ``` Ignoring regexes "the" and "yell", as well as ignoring case and punctuation: ```python >>> exact_match = evaluate.load("exact_match") >>> refs = ["the cat", "theater", "YELLING", "agent007"] >>> preds = ["cat?", "theater", "yelling", "agent"] >>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell"], ignore_case=True, ignore_punctuation=True) >>> print(round(results["exact_match"], 2)) 0.5 ``` Note that in the example above, because the regexes are ignored before the case is normalized, "yell" from "YELLING" is not deleted. Ignoring "the", "yell", and "YELL", as well as ignoring case and punctuation: ```python >>> exact_match = evaluate.load("exact_match") >>> refs = ["the cat", "theater", "YELLING", "agent007"] >>> preds = ["cat?", "theater", "yelling", "agent"] >>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell", "YELL"], ignore_case=True, ignore_punctuation=True) >>> print(round(results["exact_match"], 2)) 0.75 ``` Ignoring "the", "yell", and "YELL", as well as ignoring case, punctuation, and numbers: ```python >>> exact_match = evaluate.load("exact_match") >>> refs = ["the cat", "theater", "YELLING", "agent007"] >>> preds = ["cat?", "theater", "yelling", "agent"] >>> results = exact_match.compute(references=refs, predictions=preds, regexes_to_ignore=["the ", "yell", "YELL"], ignore_case=True, ignore_punctuation=True, ignore_numbers=True) >>> print(round(results["exact_match"], 2)) 1.0 ``` An example that includes sentences: ```python >>> exact_match = evaluate.load("exact_match") >>> refs = ["The cat sat on the mat.", "Theaters are great.", "It's like comparing oranges and apples."] >>> preds = ["The cat sat on the mat?", "Theaters are great.", "It's like comparing apples and oranges."] >>> results = exact_match.compute(references=refs, predictions=preds) >>> print(round(results["exact_match"], 2)) 0.33 ``` ## Limitations and Bias This metric is limited in that it outputs the same score for something that is completely wrong as for something that is correct except for a single character. In other words, there is no award for being *almost* right. ## Citation ## Further References - Also used in the [SQuAD metric](https://github.com/huggingface/datasets/tree/master/metrics/squad)