Mistake in gaia's scoring function.

#10
by amedhat - opened

I believe there's a mistake, or logical inconsistency in the gaia scoring function listed both here https://huggingface.co/spaces/gaia-benchmark/leaderboard/blob/main/scorer.py and in supplementary materials in your ICLR paper. here https://openreview.net/forum?id=fibxvahvs3

when the following model answer and output are fed to the scoring function, it rates it as a correct answer, both on equivalence and on being a valid grammatical sentence.

  • ground truth: "The seagull glided peacefully to my chair.",
  • model answer: "THESE A GULL GLIDED PEACEFULLY TO MY CHAIR"
amedhat changed discussion title from Mistake in gaia scoring function listed in ICLR to Mistake in gaia's scoring function.
GAIA org

Hi, thanks for your interest!
Given the dataset that we have, and the fact that we require answers to be given as word successions not sentences, the point that you raised is not an issue.
To get an exact match on our specific examples (even with incorrect spacing), you'd still need to have understood correctly both the prompt and expected answer.

clefourrier changed discussion status to closed

Hey Clementine. Thanks for responding.

This is an interesting choice given that this particular question from your validation set is precisely about asking GPT to identify the spaces in a sequence that would directly match the correct answer always if spaces are removed.

Appreciate your response.

GAIA org

Hm, can you give me the link to the question you are referring to? I might have misunderstood your comment

The question is in the second page of the validation set https://huggingface.co/datasets/gaia-benchmark/GAIA/viewer/2023_all/validation?p=1&row=124

Whereas the correct answer is: The seagull glided peacefully to my chair, the raw input of the problem, if submitted to the scoring function would also score as correct, although it isn’t, which defeats the purpose of this task.

IMG_5543.png
IMG_5544.png

GAIA org

Gotcha, super good point - I had completely missed this specific sample when we designed the scoring function!

I'm pinging people internally, reopening, we'll keep you posted

clefourrier changed discussion status to open

Great. Thank you.

Sign up or log in to comment