arxiv:2311.12983

GAIA: a benchmark for General AI Assistants

Published on Nov 21, 2023

· Featured in Daily Papers on Nov 23, 2023

Upvote

170

Authors:

Grégoire Mialon ,

Clémentine Fourrier ,

Craig Swift ,

Thomas Wolf ,

Yann LeCun ,

Thomas Scialom

Abstract

We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.

View arXiv page View PDF Add to collection

Community

pejas

Nov 23, 2023

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

vincentmin

Nov 23, 2023

In the level 3 example you say explicitly "Use commas as thousands separators in the number of minutes.". The provided answer is "Ground truth: White; 5876". Should it not be "Ground truth: White; 5,876"?

gregmialz

Paper author Nov 23, 2023

You're absolutely right: "Use commas as thousands separators in the number of minutes." comes from an older version of the dataset, we will remove it in the next version of the paper

gregmialz

Paper author Nov 23, 2023

•

edited Nov 23, 2023

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge

lewtun

Nov 24, 2023

Very cool benchmark, congrats!

Can you share any examples from levels 1 & 2 where GPT-4 got the right answer, but the human annotators didn't? I think this would be quite interesting to learn whether there's a type of multi-step question that LLMs are intrinsically better at than humans

clefourrier

Paper author Nov 24, 2023

Most of the mistakes that were made by humans validators (and why we don't get a 100% human score) were attention mistakes (misreading/mistyping something for example) rather than a difference in actual capability - unless you count "focus" as a capability, in which case we could argue that machines in general are already better at it than most of us 😅

@gregmialz would have specific examples of this.

nembal

Nov 24, 2023

GAIA is the touring test of AI!

TouristShaun

Nov 24, 2023

This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:

In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.

There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge

i wouldn't say impossible, but not sure how feasible it is to so:

require/validate specific standardized test results which must have been taken within X years (relative to the type of test) in the annotators c.v. prior to acceptance
rank annotators vs. population being compared against
annotator pay should reflect current job responsibilities and requirements to obtain

MichaelBarryUK

Nov 25, 2023

I love that NASA question. It will be something else entirely when LLM's are nailing those level 3 questions. I mean you could wrestle the answer out with some clever and patient prompt engineering and chaining, but when it can do that zero shot... Basically magic. This oughta be the gold standard.

someone13574

Nov 28, 2023

What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?

zandrrlife

Nov 28, 2023

•

edited Nov 28, 2023

This couples nicely with this benchmark suite -> GPQA: A Graduate-Level Google-Proof Q&A Benchmark[https://arxiv.org/abs/2311.12022]

deleted

Nov 29, 2023

This comment has been hidden

thomwolf

Paper author Nov 29, 2023

@someone13574 yes these questions are quite easy to re-create or slightly modify in the case of memorization.
But also: getting the right answer without a good "trace of reasoning" doesn't mean much on this dataset

lunarflu

Nov 29, 2023

Thanks @clefourrier for letting us know! 🤗

librarian-bot

Nov 29, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

malohu

Nov 29, 2023

Nice work ! Those questions are fun. It's sad the new ChatGPT with all tool (web, image, python) doesn't have a proper API so that it could be tested also. Here is a totally cherry-picked example (worked only once), and still a loss because the answer is not properly formatted :

MichaelBarryUK

Nov 30, 2023

What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?

You'd have to be a bit of a basterd to do that 😂 maybe someone would do it to poison the competition?

It's certainly not something to overlook.

Here are my thoughts:

Publish 70% of the dataset, then have 30% behind a trusted API. Hugging face et. all. could easily implement this functionality. Essentially we would all have to agree that this central authority is trustworthy and unbiased.
Regularly update the dataset. Requires humans and expensive. Who has the incentive to do this?
Synthetic dataset generated on the fly. Is this even plausible and is it self defeating?
Close your eyes and hope for the best

😂

zandrrlife

Nov 30, 2023

People really don't care about data contamination. How about we resist running to chatgpt with the dataset ha.

clefourrier

Paper author Nov 30, 2023

•

edited Nov 30, 2023

Hi! Thank you all for your points about data contamination!

This is precisely why

we only released the answers on the validation set, not on the test set, which is considerably bigger
we released the precise recipe for generating such a dataset, in the hope that it will be extended with time
we ask for the reasoning trace of the model

But since, at the moment, even the best models don't reach more than a few points on level 3, I think we have some time before us :)

MohammadOthman

Dec 3, 2023

The 'GAIA' paper presents a fascinating study but raises a crucial question: does the higher educational level of annotators, compared to the general population, affect the evaluation of AI performance? This discrepancy might skew the AI's ability to handle real-world tasks that are more representative of the broader population's capabilities and perspectives. It's vital to consider a more diverse range of annotators to truly assess AI's proficiency in real-world scenarios.

librarian-bot

Dec 6, 2023

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

skaskapro

Jan 7

•

edited Jan 7

@clefourrier @gregmialz
hi. I was looking for 'pure riddles' (number of used tools equal zero) in dataset.
Following tasks contain incorrect number of tools in solution (i.e. described solution contains 'websearch' or other tools, but 'tools' section is empty)