Same Task, More Tokens: the Impact of Input Length on the Reasoning Performance of Large Language Models

Published on Feb 19
· Submitted by akhaliq on Feb 26


This paper explores the impact of extending input lengths on the capabilities of Large Language Models (LLMs). Despite LLMs advancements in recent times, their performance consistency across different input lengths is not well understood. We investigate this aspect by introducing a novel QA reasoning framework, specifically designed to assess the impact of input length. We isolate the effect of input length using multiple versions of the same sample, each being extended with padding of different lengths, types and locations. Our findings show a notable degradation in LLMs' reasoning performance at much shorter input lengths than their technical maximum. We show that the degradation trend appears in every version of our dataset, although at different intensities. Additionally, our study reveals that traditional perplexity metrics do not correlate with performance of LLMs' in long input reasoning tasks. We analyse our results and identify failure modes that can serve as useful guides for future research, potentially informing strategies to address the limitations observed in LLMs.


You know, I've been butting up against this very problem. Taking advertised context length at its word, but continually finding that reasoning drops off around 3k tokens. For a while, I thought maybe I was doing something wrong, but Ive been coming more and more to the conclusion that regardless of model strength, when I want information dense reasoning I need to feed smaller pieces of my context in each turn.

For comprehensive bulleted note summaries, I can get consistent high quality results only under 2250 tokens.

When ranking various models for the task, I'm now realizing that I've probably overstuffed their context, and that's why I wasn't getting the full benefit of their reasoning capacity.

Paper author

One thing to consider, is that it's very possible much of the usable, effective context window of models for reasoning is taken up by system prompts in addition to overly elaborate user instructions. And while we have no access to how closed models are implemented, this degraded behaviour is consistent enough to worth considering when training future models. I'm glad we could help showcase the problems you've encountered!

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

I don't think these results are very indicative because the prompts used did not use any delimiters for data vs. instruction. Omitting delimiters greatly increases data/instruction commingling.


I've seen delimiters decrease performance using 7b parameter models. I think that testing every model using the same prompts won't lead to determining their actual capacities each on its own terms.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 1

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 5