Abstract
We introduce GAIA, a benchmark for General AI Assistants that, if solved, would represent a milestone in AI research. GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs: we show that human respondents obtain 92\% vs. 15\% for GPT-4 equipped with plugins. This notable performance disparity contrasts with the recent trend of LLMs outperforming humans on tasks requiring professional skills in e.g. law or chemistry. GAIA's philosophy departs from the current trend in AI benchmarks suggesting to target tasks that are ever more difficult for humans. We posit that the advent of Artificial General Intelligence (AGI) hinges on a system's capability to exhibit similar robustness as the average human does on such questions. Using GAIA's methodology, we devise 466 questions and their answer. We release our questions while retaining answers to 300 of them to power a leader-board available at https://huggingface.co/gaia-benchmark.
Community
This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:
In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.
In the level 3 example you say explicitly "Use commas as thousands separators in the number of minutes.". The provided answer is "Ground truth: White; 5876". Should it not be "Ground truth: White; 5,876"?
You're absolutely right: "Use commas as thousands separators in the number of minutes." comes from an older version of the dataset, we will remove it in the next version of the paper
This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:
In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.
There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge
Very cool benchmark, congrats!
Can you share any examples from levels 1 & 2 where GPT-4 got the right answer, but the human annotators didn't? I think this would be quite interesting to learn whether there's a type of multi-step question that LLMs are intrinsically better at than humans
Most of the mistakes that were made by humans validators (and why we don't get a 100% human score) were attention mistakes (misreading/mistyping something for example) rather than a difference in actual capability - unless you count "focus" as a capability, in which case we could argue that machines in general are already better at it than most of us 😅
@gregmialz would have specific examples of this.
GAIA is the touring test of AI!
This work is both fascinating and necessary, yet it seems to overlook the representation of the median human in its analysis. The educational breakdown of the annotators, as presented, does not align with the general population's educational levels. For context:
In the general U.S. population aged 25 and older in 2022, only 23% held a bachelor’s degree as their highest degree, while 14% had advanced education (like a master’s, professional, or doctoral degree), according to the Census Bureau.
In contrast, the paper indicates a much higher educational level among annotators:
Bachelor’s Degree: 61%
Master’s Degree: 26%
PhD: 17%
This discrepancy raises questions about the representativeness of the research sample compared to the general population.There is indeed likely a discrepancy that is impossible to solve between the distribution of the annotators and the general population. That being said the questions rather require fundamental abilities (planning, tool use, multi-modal understanding, etc.) than expert knowledge
i wouldn't say impossible, but not sure how feasible it is to so:
- require/validate specific standardized test results which must have been taken within X years (relative to the type of test) in the annotators c.v. prior to acceptance
- rank annotators vs. population being compared against
- annotator pay should reflect current job responsibilities and requirements to obtain
I love that NASA question. It will be something else entirely when LLM's are nailing those level 3 questions. I mean you could wrestle the answer out with some clever and patient prompt engineering and chaining, but when it can do that zero shot... Basically magic. This oughta be the gold standard.
What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?
This couples nicely with this benchmark suite -> GPQA: A Graduate-Level Google-Proof Q&A Benchmark[https://arxiv.org/abs/2311.12022]
@someone13574
yes these questions are quite easy to re-create or slightly modify in the case of memorization.
But also: getting the right answer without a good "trace of reasoning" doesn't mean much on this dataset
Thanks @clefourrier for letting us know! 🤗
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- MathVista: Evaluating Math Reasoning in Visual Contexts with GPT-4V, Bard, and Other Large Multimodal Models (2023)
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023)
- CORE-MM: Complex Open-Ended Reasoning Evaluation For Multi-Modal Large Language Models (2023)
- TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs (2023)
- Evaluating General-Purpose AI with Psychometrics (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
Nice work ! Those questions are fun. It's sad the new ChatGPT with all tool (web, image, python) doesn't have a proper API so that it could be tested also. Here is a totally cherry-picked example (worked only once), and still a loss because the answer is not properly formatted :
What is the plan for the inevitably of someone solving all the questions and putting them out on the open web? Just regularly create new problem sets?
You'd have to be a bit of a basterd to do that 😂 maybe someone would do it to poison the competition?
It's certainly not something to overlook.
Here are my thoughts:
Publish 70% of the dataset, then have 30% behind a trusted API. Hugging face et. all. could easily implement this functionality. Essentially we would all have to agree that this central authority is trustworthy and unbiased.
Regularly update the dataset. Requires humans and expensive. Who has the incentive to do this?
Synthetic dataset generated on the fly. Is this even plausible and is it self defeating?
Close your eyes and hope for the best
😂
People really don't care about data contamination. How about we resist running to chatgpt with the dataset ha.
Hi! Thank you all for your points about data contamination!
This is precisely why
- we only released the answers on the validation set, not on the test set, which is considerably bigger
- we released the precise recipe for generating such a dataset, in the hope that it will be extended with time
- we ask for the reasoning trace of the model
But since, at the moment, even the best models don't reach more than a few points on level 3, I think we have some time before us :)
The 'GAIA' paper presents a fascinating study but raises a crucial question: does the higher educational level of annotators, compared to the general population, affect the evaluation of AI performance? This discrepancy might skew the AI's ability to handle real-world tasks that are more representative of the broader population's capabilities and perspectives. It's vital to consider a more diverse range of annotators to truly assess AI's proficiency in real-world scenarios.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- GPQA: A Graduate-Level Google-Proof Q&A Benchmark (2023)
- MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI (2023)
- Developer Experiences with a Contextualized AI Coding Assistant: Usability, Expectations, and Outcomes (2023)
- TencentLLMEval: A Hierarchical Evaluation of Real-World Capabilities for Human-Aligned LLMs (2023)
- Evaluating General-Purpose AI with Psychometrics (2023)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
@clefourrier
@gregmialz
hi. I was looking for 'pure riddles' (number of used tools equal zero) in dataset.
Following tasks contain incorrect number of tools in solution (i.e. described solution contains 'websearch' or other tools, but 'tools' section is empty)
- 305ac316-eef6-4446-960a-92d80d542f82
- cf106601-ab4f-4af9-b045-5295fe67b37d
- 5a0c1adf-205e-4841-a666-7c3ef95def9d
btw, i'm not sure answers for following riddles are correct
- 42576abe-0deb-4869-8c63-225c2d75a95a (ask Gpt4 to think step by step)
- ec09fa32-d03f-4bf8-84b0-1f16922c3ae4
GAIA: Benchmarking the True Capabilities of AI Assistants
Links 🔗:
👉 Subscribe: https://www.youtube.com/@Arxflix
👉 Twitter: https://x.com/arxflix
👉 LMNT (Partner): https://lmnt.com/
Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper