arxiv:2402.13753

LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

Published on Feb 21

· Submitted by

akhaliq on Feb 22

#1 Paper of the day

Upvote

111

Authors:

Li Lyna Zhang ,

Yuanyuan Xu ,

Jiahang Xu ,

Abstract

Large context window is a desirable feature in large language models (LLMs). However, due to high fine-tuning costs, scarcity of long texts, and catastrophic values introduced by new token positions, current extended context windows are limited to around 128k tokens. This paper introduces LongRoPE that, for the first time, extends the context window of pre-trained LLMs to an impressive 2048k tokens, with up to only 1k fine-tuning steps at within 256k training lengths, while maintaining performance at the original short context window. This is achieved by three key innovations: (i) we identify and exploit two forms of non-uniformities in positional interpolation through an efficient search, providing a better initialization for fine-tuning and enabling an 8x extension in non-fine-tuning scenarios; (ii) we introduce a progressive extension strategy that first fine-tunes a 256k length LLM and then conducts a second positional interpolation on the fine-tuned extended LLM to achieve a 2048k context window; (iii) we readjust LongRoPE on 8k length to recover the short context window performance. Extensive experiments on LLaMA2 and Mistral across various tasks demonstrate the effectiveness of our method. Models extended via LongRoPE retain the original architecture with minor modifications to the positional embedding, and can reuse most pre-existing optimizations.

View arXiv page View PDF Add to collection

Community

lynazhang

Paper author Feb 22

We are currently in the process of a Microsoft open source review. Therefore, the paper’s code link is currently private. We’ll release the code and extended LLMs soon. Thanks for your patience.

Pingmeep

Feb 22

This approach looks very doable but adds cost both in time and resources to generate the model.

What I want to know is what are the inferencing resources budget impacts of this approach?

Hoioi

Feb 22

Based on my experience so far and after trying most of so-called 64k, 128k and 200k models, I'm almost sure that this method fails too and it'll be only an academic paper. How can we expect such thing when so called 128k models, can't get a 3000 token input and process that properly? If you guess I'm wrong, ask any model you'd like to expand an input with about 3000 tokens and I'm sure you will get much lower output that 3000 tokens which means the model summarized the input instead of expanding it!

Pingmeep

Feb 22

Not the author but your "test" is pretty brutal and even at smaller chunks it would be challenging from a context extension approach (which the paper is about). Most of these context extensions benefit from sampling along the way.

While all analogies break down it's like you are going to a ski resort. You want to roll a snowball up a hill and then expect to double the size with no new snow, while also giving insufficient time for the resort to create new replacement snow too.

leegao19

Feb 22

Is the main idea here that you identified two "non-uniformities" (I'm not sure if they make sense to group together):

A significant frequency inequality between head and tail dimensions when RoPE is applied
The attention-sinking of early tokens

The proposed scheme is to have monotonically increase scale (bigger divisors) over hidden dimensions, after some window. These can be represented as hyper-parameters (scale[dim], window). An evolutionary search is then applied on some guidance set to identify good parameters.

You're able to show competitive performance with data-efficient continued training (1k steps, ??? tokens) in terms of ppl at long context (evaluated on some long sequence eval-set) as well as good passkey retrieval accuracy of the following task

There is an important info hidden inside a lot of irrelevant text. Find it and memorize them. I will quiz you about the important information there.

The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. (repeat x times)

The pass key is 17865. Remember it. 17865 is the pass key.

The grass is green. The sky is blue. The sun is yellow. Here we go. There and back again. (repeat y times)

What is the pass key? The pass key is

HR1777

Feb 22

This comment has been hidden

Darth-Sidious

Feb 22

We can index and summarize tens of thousands of pages of complex medical records. Our largest file so far has been 38,000 pages. Not theoretical. Real summaries. Identifying specific document types, dates of service, focused on client directed clinical issues or medico-legal issues. We provide these to clients every day. Absolutely zero human involvement and guaranteed zero hallucinations.