Differences between this metric and SARI from the EASSE library

#2
by annadmitrieva - opened

Hi everyone, I noticed that this metric outputs different scores than the EASSE corpus_sari function. For example, this:

from easse.sari import corpus_sari
c = corpus_sari(orig_sents=["About 95 species are currently accepted.", "The cat perched on the mat."],  
            sys_sents=["About 95 you now get in.", "Cat on mat."], 
            refs_sents=[["About 95 species are currently known.", "The cat sat on the mat."],
                        ["About 95 species are now accepted.", "The cat is on the mat."],  
                        ["95 species are now accepted.", "The cat sat."]])

gives a sari of 33.17472563619544, whereas this:

import evaluate
sari = evaluate.load("sari")
sources=["About 95 species are currently accepted.", "The cat perched on the mat."]
predictions=["About 95 you now get in.", "Cat on mat."]
references=[["About 95 species are currently known.", "About 95 species are now accepted.", "95 species are now accepted.", ],
                        ["The cat sat on the mat.", "The cat is on the mat.", "The cat sat."]]
sari_score = sari.compute(sources=sources, predictions=predictions, references=references)

returns a sari of 29.13266517433184.

However, if the source and the output sentences are identical, the evaluate sari is much higher than the easse sari. Here:

from easse.sari import corpus_sari
c = corpus_sari(orig_sents=["About 95 species are currently accepted."],  
            sys_sents=["About 95 species are currently accepted."], 
            refs_sents=[["About 95 species are currently known."],
                        ["About 95 species are now accepted."],  
                        ["95 species are now accepted."]])

I got a sari of 21.873217526575058, and here:

import evaluate
sari = evaluate.load("sari")
sources=["About 95 species are currently accepted."]
predictions=["About 95 species are currently accepted."]
references=[["About 95 species are currently known.", "About 95 species are now accepted.", "95 species are now accepted.", ]]

I got a sari of 55.20655085990839.

Am I doing something wrong? Why are the scores so different?

Hi, we noticed the same problem. Is there some updates on this or explanation?
Thanks

Hi, this implementation is adapted from Tensorflow’s tensor2tensor implementation. It has two differences with the original GitHub implementation:

  1. It defines 0/0=1 instead of 0 to give higher scores for predictions that match a target exactly.
  2. It fixes an alleged bug in the keep score computation.

This is also mentioned in the original documentation. This is likely the reason for the difference.

@ddhruvkr thank you for the reply! Have you conducted any tests to ensure that this version of SARI still correlates well with the human judgment of simplicity? I'd test it myself, but the most suitable dataset for this task would be Newsela which I don't currently have access to.

Sign up or log in to comment