Nice scores guys!

#8
by vince62s - opened
Unbabel org

Cheers! We have noticed the same results internally: on what comes to neural metrics both models are very competitive -- we see a slight edge for TowerInstruct -- but on lexical metrics (BLEU, chrF) there is a huge gap in performance between the two models (favourable for Tower).

oops.
Just realized this, TowerBlocks includes WMT14-21 testsets ....

image.png

Unbabel org

Yes! Both ALMA-R and Tower include previous WMT test sets. The best dataset to compare to is WMT23.
We will release the paper very soon with all those numbers there.

when I look deeper into TowerBlocks I don't see any of these WMT testsets, please explain in what fields subset I should look.

and btw for alma, in the original paper it says: "The training parallel data is sourced from the WMT’17 to WMT’20. T"

Unbabel org

Alright, thanks for the clarification: ALMA sources from 2017 to 2020. We use WMT data from 2014 to 2022.
We did not use the full test sets --- we selected only a few samples from each test set with high-quality translations. In TowerBlocks, they are under "general_mt_clean". We also released all translation records and their sources here: https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.1-MT-records

IMO you need to be specific in the model card saying you trained on this dataset because you only mention towerblocks not towerblocks-mt-records

Unbabel org

To clarify, TowerBlocks includes TowerBlocks-MT-records. We only created the latter because some practitioners asked us to create a dataset composed exclusively by the MT records in TowerBlocks.

okay, then back to my initial question.
in TowerBlocks, if I select task="machine translation" and "split=train" I am getting only the following "datasets": news21_docs_filtered, opus_doc_filtered, ted_talks_doc_filtered
so wmt14-22 are included where ?

Unbabel org

Oh I see now! That can indeed be a bit confusing -- we will make that clearer in the model card.
We have used data from training and development (or testing) sets in TowerBlocks --- the split refers to that (the origin of the data, and not how we used it to build TowerInstruct). To get access to all the sentence-level MT data, disregard the "split" column, and only select the task.

Sign up or log in to comment