I have a question about the setup of GAIA in the paper

#3
by Said2k - opened

" - Simplicity of use. Crucially, the answers to our questions are factoid, concise and unambiguous. These 2 properties allow simple, fast and factual evaluation. Our questions are meant to be answered in zero shot, limiting the influence of the evaluation setup. By opposition, many LLM benchmarks require evaluations that are sensitive to the experimental setup such as the number and nature of prompts (Liang et al., 2022b) (Section 8.2), or the benchmark implementation. "

It seems as if you're evaluating factual recall. How is the models ability to recite facts is tied to the actual USP of an agent. Why is the model supposed to be an oracle of information when LLMs function more as an interface, especially in the case of the general assistance.

GAIA org

Hi! This is a benchmark for augmented LLMs, meaning LLMs with tooling, such as web search for example. We are not evaluating recall per se (as some answers shouldn't be in the training data), but capabilities more aligned with information extraction for the point you are referring to.

clefourrier changed discussion status to closed

Sign up or log in to comment