wrong output?

#2
by ceyda - opened

BLEURT's output is said to be between 0 and 1(approx) but I get negative scores too like: {'scores': [-0.9087899327278137, -0.6429446339607239]}
see: https://imgur.com/KIaqIPj
?

Evaluate Metric org

cc @sasha

Here are other examples of scores outside the range [0, 1]:

from datasets import load_metric

bleurt = load_metric("bleurt", module_type="metric", checkpoint="bleurt-large-512")

>>> bleurt.compute(references=["this is a test"], predictions=["this is a test"])
{'scores': [1.0118293762207031]}

>>> bleurt.compute(references=["this is a test"], predictions=["this is a boat"])
{'scores': [-1.3691496849060059]}

I think scores slightly above 1 below 0 are expected (see Interpreting BLEURT Scores) but a score of -1.4 seems like an error.

Sign up or log in to comment