evaluate-metric/bleurt · wrong output?

Jun 8, 2022

BLEURT's output is said to be between 0 and 1(approx) but I get negative scores too like: {'scores': [-0.9087899327278137, -0.6429446339607239]}
see: https://imgur.com/KIaqIPj
?

lvwerra

Evaluate Metric org Jun 8, 2022

cc @sasha

johngiorgi

Jun 18, 2022

•

edited Jun 21, 2022

Here are other examples of scores outside the range [0, 1]:

from datasets import load_metric

bleurt = load_metric("bleurt", module_type="metric", checkpoint="bleurt-large-512")

>>> bleurt.compute(references=["this is a test"], predictions=["this is a test"])
{'scores': [1.0118293762207031]}

>>> bleurt.compute(references=["this is a test"], predictions=["this is a boat"])
{'scores': [-1.3691496849060059]}

I think scores slightly above 1 below 0 are expected (see Interpreting BLEURT Scores) but a score of -1.4 seems like an error.