Papers
arxiv:2406.10149

BABILong: Testing the Limits of LLMs with Long Context Reasoning-in-a-Haystack

Published on Jun 14
· Submitted by yurakuratov on Jun 17

Abstract

In recent years, the input context sizes of large language models (LLMs) have increased dramatically. However, existing evaluation methods have not kept pace, failing to comprehensively assess the efficiency of models in handling long contexts. To bridge this gap, we introduce the BABILong benchmark, designed to test language models' ability to reason across facts distributed in extremely long documents. BABILong includes a diverse set of 20 reasoning tasks, including fact chaining, simple induction, deduction, counting, and handling lists/sets. These tasks are challenging on their own, and even more demanding when the required facts are scattered across long natural text. Our evaluations show that popular LLMs effectively utilize only 10-20\% of the context and their performance declines sharply with increased reasoning complexity. Among alternatives to in-context reasoning, Retrieval-Augmented Generation methods achieve a modest 60\% accuracy on single-fact question answering, independent of context length. Among context extension methods, the highest performance is demonstrated by recurrent memory transformers, enabling the processing of lengths up to 11 million tokens. The BABILong benchmark is extendable to any length to support the evaluation of new upcoming models with increased capabilities, and we provide splits up to 1 million token lengths.

Community

Paper author Paper submitter
Paper author

One of our main findings is that simple commonsense reasoning of BABILong is still a challenge for current long-context models

Even models that claim to support 128K tokens experience degradation beyond 10% of their input capacity. RAG methods do not help, while fine-tuning of small scale models (RMT 137M and Mamba 130M) shows that the tasks are solvable. Values represent average accuracy over QA1-QA5 tasks from BABILong.

image.png

Paper author

Q: How effectively do LLMs use context window in QA tasks?

A: LLMs struggle to answer questions about facts in texts larger than 10,000 tokens.

The plots demonstrate how the performance of selected leading models deteriorates with increasing context size. For single supporting fact questions (QA1), the majority of models perform well up to 4,000 tokens. However, when a correct response requires two (QA2) or three (QA3) facts, LLMs fail to achieve satisfactory accuracy.

image.png

Thanks for your amazing work!
regina-george-stop-needle-in-a-haystack.gif

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2406.10149 in a model README.md to link it from this page.

Datasets citing this paper 2

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2406.10149 in a Space README.md to link it from this page.

Collections including this paper 5