Llama-3.1 70B Math Hard doesn't match its dataset

#1041
by MaziyarPanahi - opened

Hi @alozowski
A quick question, in the UI Llama-3.1 70B Instrcut has 0.31 raw MATH Hard score:

image.png

But looking into the dataset, it's much much lower:

image.png

image.png

References: https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details

Open LLM Leaderboard org

cc @SaylorTwift , when you updated the results, did you update the details too?
@MaziyarPanahi we had identified an issue in the MATH parsing (thanks to Meta), so all scores were updated a while back (I'd say 1 or 2 months ago now?) , maybe the details were not updated at the same time (since generations were the same and we just needed to recompute the answer extraction and average)

Thanks @clefourrier , it makes sense now. Appreciate the response.

Open LLM Leaderboard org

Hi ! yes that's right, details were overlooked when updating the results, only th results files that you can find in the results repo were updated so that they can be diplsayed in the leaderboard. Sorry for the confusion !

clefourrier changed discussion status to closed

Sign up or log in to comment