--- title: Sari Metric emoji: 🐠 colorFrom: pink colorTo: indigo sdk: gradio sdk_version: 3.28.3 app_file: app.py pinned: false tags: - evaluate - metric description: >- SARI is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. Sari = (F1_add + F1_keep + P_del) / 3 where F1_add: n-gram F1 score for add operation F1_keep: n-gram F1 score for keep operation P_del: n-gram precision score for delete operation n = 4, as in the original paper. This implementation is adapted from Tensorflow's tensor2tensor implementation [3]. It has two differences with the original GitHub [1] implementation: (1) Defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly. (2) Fixes an alleged bug [2] in the keep score computation. [1] https://github.com/cocoxu/simplification/blob/master/SARI.py (commit 0210f15) [2] https://github.com/cocoxu/simplification/issues/6 [3] https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py --- # Metric Card for SARI ## Metric description SARI (***s**ystem output **a**gainst **r**eferences and against the **i**nput sentence*) is a metric used for evaluating automatic text simplification systems. The metric compares the predicted simplified sentences against the reference and the source sentences. It explicitly measures the goodness of words that are added, deleted and kept by the system. SARI can be computed as: `sari = ( F1_add + F1_keep + P_del) / 3` where `F1_add` is the n-gram F1 score for add operations `F1_keep` is the n-gram F1 score for keep operations `P_del` is the n-gram precision score for delete operations The number of n grams, `n`, is equal to 4, as in the original paper. This implementation is adapted from [Tensorflow's tensor2tensor implementation](https://github.com/tensorflow/tensor2tensor/blob/master/tensor2tensor/utils/sari_hook.py). It has two differences with the [original GitHub implementation](https://github.com/cocoxu/simplification/blob/master/SARI.py): 1) It defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly. 2) It fixes an [alleged bug](https://github.com/cocoxu/simplification/issues/6) in the keep score computation. ## How to use The metric takes 3 inputs: sources (a list of source sentence strings), predictions (a list of predicted sentence strings) and references (a list of lists of reference sentence strings) ```python from evaluate import load sari = load("hxw15/sari_metric") sources=["About 95 species are currently accepted."] predictions=["About 95 you now get in."] references=[["About 95 species are currently known.","About 95 species are now accepted.","95 species are now accepted."]] results = sari.compute(sources=sources, predictions=predictions, references=references) ``` ## Output values This metric outputs a dictionary with the SARI score: ``` print(results) {'sari': 26.953601953601954, 'keep': 22.527472527472526, 'del': 50.0, 'add': 8.333333333333332} ``` The range of values for the SARI score is between 0 and 100 -- the higher the value, the better the performance of the model being evaluated, with a SARI of 100 being a perfect score. ### Values from popular papers The [original paper that proposes the SARI metric](https://aclanthology.org/Q16-1029.pdf) reports scores ranging from 26 to 43 for different simplification systems and different datasets. They also find that the metric ranks all of the simplification systems and human references in the same order as the human assessment used as a comparison, and that it correlates reasonably with human judgments. More recent SARI scores for text simplification can be found on leaderboards for datasets such as [TurkCorpus](https://paperswithcode.com/sota/text-simplification-on-turkcorpus) and [Newsela](https://paperswithcode.com/sota/text-simplification-on-newsela). ## Examples Perfect match between prediction and reference: ```python from evaluate import load sari = load("hxw15/sari_metric") sources=["About 95 species are currently accepted ."] predictions=["About 95 species are currently accepted ."] references=[["About 95 species are currently accepted ."]] results = sari.compute(sources=sources, predictions=predictions, references=references) print(results) {'sari': 100.0, 'keep': 100.0, 'del': 100.0, 'add': 100.0} ``` Partial match between prediction and reference: ```python from evaluate import load sari = load("hxw15/sari_metric") sources=["About 95 species are currently accepted ."] predictions=["About 95 you now get in ."] references=[["About 95 species are currently known .","About 95 species are now accepted .","95 species are now accepted ."]] results = sari.compute(sources=sources, predictions=predictions, references=references) print(results) {'sari': 26.953601953601954, 'keep': 22.527472527472526, 'del': 50.0, 'add': 8.333333333333332} ``` ## Limitations and bias SARI is a valuable measure for comparing different text simplification systems as well as one that can assist the iterative development of a system. However, while the [original paper presenting SARI](https://aclanthology.org/Q16-1029.pdf) states that it captures "the notion of grammaticality and meaning preservation", this is a difficult claim to empirically validate. ## Citation ```bibtex @inproceedings{xu-etal-2016-optimizing, title = {Optimizing Statistical Machine Translation for Text Simplification}, authors={Xu, Wei and Napoles, Courtney and Pavlick, Ellie and Chen, Quanze and Callison-Burch, Chris}, journal = {Transactions of the Association for Computational Linguistics}, volume = {4}, year={2016}, url = {https://www.aclweb.org/anthology/Q16-1029}, pages = {401--415}, } ``` ## Further References - [NLP Progress -- Text Simplification](http://nlpprogress.com/english/simplification.html) - [Hugging Face Hub -- Text Simplification Models](https://huggingface.co/datasets?filter=task_ids:text-simplification)