LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Published on Feb 21
Featured in Daily Papers on Feb 22


Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.


Paper author

We are currently in the process of a Microsoft open source review. Therefore, the paper鈥檚 code link is currently private. We鈥檒l release the code and extended LLMs soon. Thanks for your patience.

This approach looks very doable but adds cost both in time and resources to generate the model.

What I want to know is what are the inferencing resources budget impacts of this approach?

Based on my experience so far and after trying most of so-called 64k, 128k and 200k models, I'm almost sure that this method fails too and it'll be only an academic paper. How can we expect such thing when so called 128k models, can't get a 3000 token input and process that properly? If you guess I'm wrong, ask any model you'd like to expand an input with about 3000 tokens and I'm sure you will get much lower output that 3000 tokens which means the model summarized the input instead of expanding it!

Not the author but your "test" is pretty brutal and even at smaller chunks it would be challenging from a context extension approach (which the paper is about). Most of these context extensions benefit from sampling along the way.

While all analogies break down it's like you are going to a ski resort. You want to roll a snowball up a hill and then expect to double the size with no new snow, while also giving insufficient time for the resort to create new replacement snow too.

Is the main idea here that you identified two "non-uniformities" (I'm not sure if they make sense to group together):

  1. A significant frequency inequality between head and tail dimensions when RoPE is applied
  2. The attention-sinking of early tokens

The proposed scheme is to have monotonically increase scale (bigger divisors) over hidden dimensions, after some window. These can be represented as hyper-parameters (scale[dim], window). An evolutionary search is then applied on some guidance set to identify good parameters.

You're able to show competitive performance with data-efficient continued training (1k steps, ??? tokens) in terms of ppl at long context (evaluated on some long sequence eval-set) as well as good passkey retrieval accuracy of the following task

There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.

The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. (repeat x times)

The pass key is 17865. Remember it. 17865 is the pass key.

The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. (repeat y times)

What is the pass key? The pass key is
This comment has been hidden

We can index and summarize tens of thousands of pages of complex medical records. Our largest file so far has been 38,000 pages. Not theoretical. Real summaries. Identifying specific document types, dates of service, focused on client directed clinical issues or medico-legal issues. We provide these to clients every day. Absolutely zero human involvement and guaranteed zero hallucinations.

how to deal with such long files?

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Guys how can I use @librarian-bot recommend? I've just tried this on other paper page and it did not seem work?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 40