E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

Published on Jan 13
· Featured in Daily Papers on Jan 17
Jie Fu ,


Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issues, we propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced computation cost, which also removes the need to collect long-context data. Concretely, first, the training data of our E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost greatly. Second, the training procedure on the short training context window is performed only once time, and we can support different evaluation context windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings, we introduce two different augmentation methods on the scale and position index parameters for different samples in training. It aims to make the model more robust to the different relative differences when directly interpolating the arbitrary context length at inference. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

Hi folks, after performing the sampled scaled + shifted fine-tuning, do you see the resulting model improve extrapolation at long sequences (> previously trained context window) without scaling up g (for free)?

A common suspicion many people have is that the self-attention overfits the (admittedly very sparse) integer relative positions (e.g. 0 .. 2048 or 0 .. 4096) and coupled with some approximation-theoretic failures. This could be why extrapolation fails so catastrophically - the attention doesn't learn the necessary representations to use the rotary encoding (e.g. the rotational invariance) and overfits an approximation (maybe a polynomial) that fails catastrophically at the training boundary.

The scheme presented in $E^2$-LLM seems to resolve the sparsity issue, and if the suspicion is correct, you should also see a corresponding improvement in extrapolations without Positional Interpolation during inference (as long as self-attention finds a way to learn the proper representation for the rotary encoding)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 14