Spaces:

open-llm-leaderboard
/

open_llm_leaderboard

Running on CPU Upgrade

App Files Files Community

1049

Possibility that Qwen2.5-72B-Instruct evaluation was not recorded correctly

#942

by SandInTheDunes - opened Sep 23

Discussion

SandInTheDunes

Sep 23

Hello!
I noticed a big discrepancy in math results between base and instruct models. I don't have access to evaluation details page, but one user commented on Reddit: "The tool at open-llm-leaderboard probably wasn't checking the answer correctly. From the raw answers you can see that the model gave the correct results but they were all counted as invalid."

Could you please check?

Thank you!

SandInTheDunes

Sep 23

•

edited Sep 23

If I'm not mistaken, similar case was with Meta-Llama-3.1-70B-Instruct, which had very low math score, but now I see that it was fixed.

clefourrier

Open LLM Leaderboard org Sep 23

Hi!

These instruct models have the same issue as the instruct Llama 3.1 models, which is that they don't follow the few-shot format.
(It's something that appears regularly in models that have been fine-tuned to follow specific instruction formats preferentially).

They therefore give a good answer in a format which is not the one provided in the few shot template (we use the minerva-math template), hence their answers are counted as incorrect.

We expect quality models to be able to correctly follow a few-shot template format, hence why we are penalizing this.

SandInTheDunes

Sep 23

But Llama-3.1 test score was updated, will you apply same treatment for Qwen2.5? That would be fair. It would be new #1 with big lead! :)

And from the practical perspective, users would benefit. Although answer was not in precise format, but model gave good answer - that's what counts for real world users.

clefourrier

Open LLM Leaderboard org Sep 23

•

edited Sep 23

Llama3.1 test score was not updated for the above issue, and if you look at the details you will see that the instruct models get 0 for most answers (where they did not follow the correct format).

Although answer was not in precise format, but model gave good answer - that's what counts for real world users.

Being able to answer following a few-shot format is actually very important for users - you want them to be able to follow examples correctly, for example to write data in the formatted way which you provide as example. So we'll keep this format choice for now (also because it indicates quite clearly when models overfit specific formats, which can make them useless for some tasks)

SandInTheDunes

Sep 23

Thank you for the clarification!

clefourrier changed discussion status to closed Sep 27

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment