Possibility that Qwen2.5-72B-Instruct evaluation was not recorded correctly

#942
by SandInTheDunes - opened

Hello!
I noticed a big discrepancy in math results between base and instruct models. I don't have access to evaluation details page, but one user commented on Reddit: "The tool at open-llm-leaderboard probably wasn't checking the answer correctly. From the raw answers you can see that the model gave the correct results but they were all counted as invalid."

Could you please check?

Thank you!

If I'm not mistaken, similar case was with Meta-Llama-3.1-70B-Instruct, which had very low math score, but now I see that it was fixed.

Open LLM Leaderboard org

Hi!

These instruct models have the same issue as the instruct Llama 3.1 models, which is that they don't follow the few-shot format.
(It's something that appears regularly in models that have been fine-tuned to follow specific instruction formats preferentially).

They therefore give a good answer in a format which is not the one provided in the few shot template (we use the minerva-math template), hence their answers are counted as incorrect.

We expect quality models to be able to correctly follow a few-shot template format, hence why we are penalizing this.

But Llama-3.1 test score was updated, will you apply same treatment for Qwen2.5? That would be fair. It would be new #1 with big lead! :)

And from the practical perspective, users would benefit. Although answer was not in precise format, but model gave good answer - that's what counts for real world users.

Open LLM Leaderboard org
edited Sep 23

Llama3.1 test score was not updated for the above issue, and if you look at the details you will see that the instruct models get 0 for most answers (where they did not follow the correct format).

Although answer was not in precise format, but model gave good answer - that's what counts for real world users.

Being able to answer following a few-shot format is actually very important for users - you want them to be able to follow examples correctly, for example to write data in the formatted way which you provide as example. So we'll keep this format choice for now (also because it indicates quite clearly when models overfit specific formats, which can make them useless for some tasks)

Thank you for the clarification!

clefourrier changed discussion status to closed

Sign up or log in to comment