Nice scores guys!

by vince62s - opened Jan 29, 2024

Discussion

vince62s

Jan 29, 2024

Comparison with ALMA-7B-R
https://forum.opennmt.net/t/llms-as-nmt-comparison-between-alma-7b-r-and-towerinstruct/5641

nunonmg

Unbabel org Jan 30, 2024

Cheers! We have noticed the same results internally: on what comes to neural metrics both models are very competitive -- we see a slight edge for TowerInstruct -- but on lexical metrics (BLEU, chrF) there is a huge gap in performance between the two models (favourable for Tower).

vince62s

Feb 26, 2024

•

edited Feb 26, 2024

oops.
Just realized this, TowerBlocks includes WMT14-21 testsets ....

nunonmg

Unbabel org Feb 26, 2024

Yes! Both ALMA-R and Tower include previous WMT test sets. The best dataset to compare to is WMT23.
We will release the paper very soon with all those numbers there.

vince62s

Feb 26, 2024

•

edited Feb 26, 2024

when I look deeper into TowerBlocks I don't see any of these WMT testsets, please explain in what fields subset I should look.

and btw for alma, in the original paper it says: "The training parallel data is sourced from the WMT’17 to WMT’20. T"

nunonmg

Unbabel org Feb 26, 2024

Alright, thanks for the clarification: ALMA sources from 2017 to 2020. We use WMT data from 2014 to 2022.
We did not use the full test sets --- we selected only a few samples from each test set with high-quality translations. In TowerBlocks, they are under "general_mt_clean". We also released all translation records and their sources here: https://huggingface.co/datasets/Unbabel/TowerBlocks-v0.1-MT-records

vince62s

Feb 26, 2024

•

edited Feb 26, 2024

IMO you need to be specific in the model card saying you trained on this dataset because you only mention towerblocks not towerblocks-mt-records

nunonmg

Unbabel org Feb 26, 2024

To clarify, TowerBlocks includes TowerBlocks-MT-records. We only created the latter because some practitioners asked us to create a dataset composed exclusively by the MT records in TowerBlocks.

vince62s

Feb 26, 2024

okay, then back to my initial question.
in TowerBlocks, if I select task="machine translation" and "split=train" I am getting only the following "datasets": news21_docs_filtered, opus_doc_filtered, ted_talks_doc_filtered
so wmt14-22 are included where ?

nunonmg

Unbabel org Feb 26, 2024

Oh I see now! That can indeed be a bit confusing -- we will make that clearer in the model card.
We have used data from training and development (or testing) sets in TowerBlocks --- the split refers to that (the origin of the data, and not how we used it to build TowerInstruct). To get access to all the sentence-level MT data, disregard the "split" column, and only select the task.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment