Re: Machine Translation performance

#9
by bri25yu - opened

image.png

Not 100% if this is a fair comparison. Also it seems strange that they only report BLEURT scores as opposed to BLEU or chrF++ scores. Thoughts?

Also can somebody with more experience using BLEURT say something about how meaningful the difference in performance of 74.4 (Gemini Ultra, all languages) vs 73.8 (GPT-4, all languages) is?

@bri25yu

I tried to ask perplexity.ai interpret the results; take this with a grain of salt :D

In your case, the model with a BLEURT score of 74.4 is slightly better at generating translations that are fluent and convey the same meaning as the reference sentences than the model with a score of 73.8. However, the difference is relatively small, and it might not be significant depending on the specific use case and the variability of the scores. It's also worth noting that while BLEURT is a useful tool for comparing different systems' performance on the same task, it's not perfect and a high BLEURT score does not necessarily mean high quality translations in all cases

BLEURT is a very good metric, so not odd that it is included. It's also developed by Google. I agree that it'd be better if they'd also include another metric, preferably CHRF for it's language/tokenizer-agnosticism, but only in addition to BLEURT - not as a replacement.

BLEU should be avoided. It's not a good metric (correlates poorly with human judgments). Just look at the title of the latest findings of the WMT shared task ;-) People are stubborn and keep using it, but there's seldom a good reason for it.

Unfortunately they did not do any significance testing so it is hard to know how meaningful the relative differences are between the systems.

Gotcha, thanks for the explanation Bram!

Sign up or log in to comment