Evaluating parallel corpus of 55m sentences with unbabel metrics

#2
by dchaplinsky - opened

Hi.

I built a combined deduplicated corpus from mediocre-quality parallel corpora found on opus-mt's website, which has a size of 55m of sentences.

I want to evaluate it with unbabel's models and publish the data with scores.
I'm using the wmt23-cometkiwi-da-xxl model for that, but I also want to add other models. What models should I use?
My initial plan was to run wmt23-cometkiwi-da-xl, but I also found that you have wmt22-cometkiwi-da, wmt23-cometkiwi-da-xl-marian and others.
Any guidance? Are those old, obsolete, or useful as well?

We also ran the correlation analysis on a couple of millions of sentences back in the day to compare da-xxl and da-xl.

photo_2024-12-28 20.28.42.jpeg
photo_2024-12-28 20.28.43.jpeg
photo_2024-12-28 20.28.44.jpeg
photo_2024-12-28 20.28.45.jpeg
photo_2024-12-28 20.28.46.jpeg
photo_2024-12-28 20.28.50.jpeg

We also did a small-scale analysis of the correlation between expert score and scores given by unbabel's models, can share the results if you are interested.

Last but not least, I experimented with the matmul precision setting and compared medium to highest on 2m of samples using xxl model (tldr, it is safe to use medium precision and it'll give a 3x speedup for the inference).

Unbabel org

Hi! Thanks for the info; looks interesting. It would be nice if you could share those additional results.

All the CometKiwi metrics you named are good options in the sense that they correlate quite well with human judgements.

That said, they are expected to correlate strongly among each other because they were trained with similar data and architecutres -- especially 23-xl and 23-xxl (exactly the same data and architecture, just different number of parameters). As such, the amount of extra information you will gain by running a new metric will be limited. In any case, I would recommend running the 22 version if you already have the 23-xxl one.

Also, you can consider another of our metrics, xCOMET, which is more recent, and Google's MetricX. They are state-of-the-art and less similar to CometKiwi.

Hope this helps.

Hi. Thanks for a quick response.

We've been using xCOMET to evaluate the NMT quality of our models, but we cannot use it on the filtering stage, as we have no reference point, just a noisy gigantic dataset of a mediocre quality (to put it mildly).

MetricX seems like a good find, thanks! Will do the 22 version as well.

My code for the dataset collection, deduplication and evaluation is available in this repo.
Here is the comparison of XXL results on two million of sentences using different matmul precision (medium and highest):

All statistics are calculated over the absolute difference between scores on the same pair:

min: 0.000000
q1: 0.000028
median: 0.000059
q3: 0.000105
max: 0.027322
mean: 0.000081
iqr: 0.000077
variance: 0.000000
std: 0.000109

Full histogram:

:tmp:score_differences_full.png

Zoomed-in version without dominating bin:

:tmp:score_differences_filtered.png

Unbabel org

Thanks for sharing this! Nice to know that scores are very similar with lower precision.

xCOMET can be used without a reference as well; you just need to not pass a reference in the inputs.

Sign up or log in to comment