Spaces:
Running
on
CPU Upgrade
Llama-3.1 70B Math Hard doesn't match its dataset
Hi
@alozowski
A quick question, in the UI Llama-3.1 70B Instrcut has 0.31 raw MATH Hard score:
But looking into the dataset, it's much much lower:
References: https://huggingface.co/datasets/open-llm-leaderboard/meta-llama__Meta-Llama-3.1-70B-Instruct-details
cc
@SaylorTwift
, when you updated the results, did you update the details too?
@MaziyarPanahi
we had identified an issue in the MATH parsing (thanks to Meta), so all scores were updated a while back (I'd say 1 or 2 months ago now?) , maybe the details were not updated at the same time (since generations were the same and we just needed to recompute the answer extraction and average)
Thanks @clefourrier , it makes sense now. Appreciate the response.
Hi ! yes that's right, details were overlooked when updating the results, only th results files that you can find in the results repo were updated so that they can be diplsayed in the leaderboard. Sorry for the confusion !