--- title: Anls emoji: 📚 colorFrom: indigo colorTo: purple sdk: gradio sdk_version: 3.17.0 app_file: app.py pinned: false --- tags: - evaluate - metric description: >- This metric wrap the official scoring script for version 1 of the Average Normalized Levenshtein Similarity (ANLS). --- # Metric Card for ANLS ## Metric description This metric wraps the official scoring script for version 1 of the Average Normalized Levenshtein Similarity (ANLS). The ANLS smoothly captures the OCR mistakes applying a slight penalization in case of correct intended responses, but badly recognized. It also makes use of a threshold of value 0.5 that dictates whether the output of the metric will be the ANLS if its value is equal or bigger than 0.5 or 0 otherwise. The key point of this threshold is to determine if the answer has been correctly selected but not properly recognized, or on the contrary, the output is a wrong text selected from the options and given as an answer. More formally, the ANLS between the net output and the ground truth answers is given by equation 1. Where N is the total number of questions, M total number of GT answers per question, aij the ground truth answers where i = {0, ..., N}, and j = {0, ..., M}, and oqi be the network's answer for the ith question qi. ![alt text](https://rrc.cvc.uab.es/files/ANLS.png) Reference: [Evaluation Metric](https://rrc.cvc.uab.es/?ch=11&com=tasks) ## How to use The metric takes two lists of question-answers dictionaries as inputs, one with the predictions of the model and the other with the references to be compared to. _predictions_: List of question-answers dictionaries with the following key-values: - 'question_id': id of the question-answer pair as given in the references (see below) - 'prediction_text': the text of the answer _references_: List of question-answers dictionaries with the following key-values: - 'question_id': id of the question-answer pair (see above) - 'answers': list of possible texts for the answer, as a list of strings ```python from evaluate import load squad_metric = load("anls") results = anls_metric.compute(predictions=predictions, references=references) ``` ## Output values This metric outputs a dictionary with value 'anls_score' between 0.0 and 1.0 ``` {'anls_score': 1.0} ``` ## Examples ```python from evaluate import load anls_metric = load("anls") predictions = [{'question_id': '10285', 'prediction_text': 'Denver Broncos'}, {'question_id': '18601', 'prediction_text': '12/15/89'}, {'question_id': '16734', 'prediction_text': 'Dear dr. Lobo'}] references = [{"answers": ["Denver Broncos", "Denver R. Broncos"], 'question_id': '10285'}, {'answers': ['12/15/88'], 'question_id': '18601'}, {'answers': ['Dear Dr. Lobo', 'Dr. Lobo'], 'question_id': '16734'}] results = anls_metric.compute(predictions=predictions, references=references) results {'anls_metric': 1.0} ``` ## Limitations and bias This metric works only with datasets that have the same format as specified above. ## Considerations / Assumptions As specified in website: [Tasks - Document Visual Question Answering](https://rrc.cvc.uab.es/?ch=17&com=tasks) - Answers are not case sensitive - Answers are space sensitive - Answers or tokens comprising answers are not limited to a fixed size dictionary. It could be any word/token which is present in the document. ## Citation @article{, title = {Binary codes capable of correcting deletions, insertions, and reversals}, journal = {Soviet physics doklady}, volume = {10}, number = {8}, pages = {707--710}, year = {1966}, url = {https://nymity.ch/sybilhunting/pdf/Levenshtein1966a.pdf}, author = {V. I. Levenshtein}, ## Further References - [The Stanford Question Answering Dataset: Background, Challenges, Progress (blog post)](https://rajpurkar.github.io/mlx/qa-and-squad/) - [Hugging Face Course -- Question Answering](https://huggingface.co/course/chapter7/7)