The GPT-4 comparison is a bit too early (yet?)

#1
by cmp-nct - opened

I really like your work, it's a clean approach on translation and it's possibly the best open approach for translation we have.
However, the comparison to GPT-4(preview) level is not correct yet.

I've been struggling with en->german translations for a while, tried a ton of approaches.
This new LORA appears slightly better than the previous one in my comparisons but it still makes grammatical errors where GPT-4 is flawless.

I'm getting very good results when giving the model only small sentences, compareable to GPT4 in those cases.
But when you give it a paragraph the grammar starts to degrade.
When comparing this to GPT4, the quality improves further with larger sentences.

Switching to separate short sentences is not an option as translation needs context when one words meaning depends on a previous sentence.

Maybe there is additional tuning possible for paragraphs ?

Hi @cmp-nct ! Thank you for your interest in our work and for testing our model so quickly! Your feedback provides valuable insights for our future work in exploring better methods for translating from English to German.

In our comparison with GPT-4-1106-preview, we considered not only translation into German but also other language pairs and directions, such as Chinese and Icelandic. The performance we reported is an average across 10 different language directions three differenet metrics. However, it's true that specifically for English to German translation, GPT-4 still performs slightly better. (Refer to Table 2: GPT-4 84.91 KIWI-XXL vs. ALMA-13B-R 84.25).

Hi @haoranxu
I enjoyed your paper too. Not being a criticism, because I think the methodology is good, I posted this here a couple of days ago: https://www.linkedin.com/posts/vincentnguyenngoc_microsoft-released-a-new-paper-with-llm-as-activity-7153391325722152960-HeCJ?utm_source=share&utm_medium=member_desktop
Feel free to comment but my main question is 1) how do you explain the -10 BLEU for EN-DE vs Sota systems (I still played a lot with EN-DE) and 2) did you do human eval of your work ?
lastly did you try to apply CPO with an encoder-decoder architecture ?

Hi @vince62s I just replied to your post before I saw this message!

how do you explain the -10 BLEU for EN-DE vs Sota systems

For the first question, I have replied under the post. But copied here: I also noticed the BLEU drop but improvements in the other metrics. As a Chinese speaker, I examined the test set for en->zh and found that ALMA-R provides more fluent and natural translations. I assume that the BLEU drop is due to domain mismatch with the WMT topics.

A good example from WMT’22 for en->zh:
src: I'm sorry that your order is running late.
tgt: 很抱歉,您点的餐可能会晚到一会。
ALMA-R: 对不起,您的订单延误了。

The BLEU is very low because the lexical overlapping is almost 0. But I feel ALMA-R is slightly better (because "订单" here is more suitable for "order") and they are the basically same meaning. I often find that WMT winners have nonsensically high BLEU scores, like over 64.2 for cs->en. So, I guess this is because the WMT winner for cs->en just learned very well in the domain of WMT’22.

did you do human eval of your work ?

Regarding the human evaluation, we are doing our best to implement it in the next step.

lastly did you try to apply CPO with an encoder-decoder architecture ?

No, we haven't done so yet, but it will be interesting to explore this :)!

FYI I am one of the maintainer of OpenNMT-py and have been working on NMT for a long long time.
I was about to implement DPO in the repo when I saw your paper, so it seems easier to implement CPO and will do it.
Now for the results, maybe it worked fine for EN-ZH but honestly I think for EN-DE it overfitted the metric (comet). -10 Bleu is HUGE at a system level. I can understand that a sentence level this can happen but for a system I don't think so. I will be very interested to look at the human eval. Anyhow nice paper, congrats.

Sign up or log in to comment