Spaces:

gaia-benchmark
/

leaderboard

Running on CPU Upgrade

App Files Files Community

Mistake in gaia's scoring function.

#10

by amedhat - opened Feb 13, 2024

Discussion

amedhat

Feb 13, 2024

•

edited Feb 13, 2024

I believe there's a mistake, or logical inconsistency in the gaia scoring function listed both here https://huggingface.co/spaces/gaia-benchmark/leaderboard/blob/main/scorer.py and in supplementary materials in your ICLR paper. here https://openreview.net/forum?id=fibxvahvs3

when the following model answer and output are fed to the scoring function, it rates it as a correct answer, both on equivalence and on being a valid grammatical sentence.

ground truth: "The seagull glided peacefully to my chair.",
model answer: "THESE A GULL GLIDED PEACEFULLY TO MY CHAIR"

amedhat changed discussion title from Mistake in gaia scoring function listed in ICLR to Mistake in gaia's scoring function. Feb 13, 2024

clefourrier

GAIA org Feb 14, 2024

Hi, thanks for your interest!
Given the dataset that we have, and the fact that we require answers to be given as word successions not sentences, the point that you raised is not an issue.
To get an exact match on our specific examples (even with incorrect spacing), you'd still need to have understood correctly both the prompt and expected answer.

clefourrier changed discussion status to closed Feb 14, 2024

amedhat

Feb 14, 2024

Hey Clementine. Thanks for responding.

This is an interesting choice given that this particular question from your validation set is precisely about asking GPT to identify the spaces in a sequence that would directly match the correct answer always if spaces are removed.

Appreciate your response.

clefourrier

GAIA org Feb 14, 2024

Hm, can you give me the link to the question you are referring to? I might have misunderstood your comment

amedhat

Feb 14, 2024

The question is in the second page of the validation set https://huggingface.co/datasets/gaia-benchmark/GAIA/viewer/2023_all/validation?p=1&row=124

Whereas the correct answer is: The seagull glided peacefully to my chair, the raw input of the problem, if submitted to the scoring function would also score as correct, although it isn’t, which defeats the purpose of this task.

clefourrier

GAIA org Feb 14, 2024

Gotcha, super good point - I had completely missed this specific sample when we designed the scoring function!

I'm pinging people internally, reopening, we'll keep you posted

clefourrier changed discussion status to open Feb 14, 2024

amedhat

Feb 14, 2024

Great. Thank you.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment