arxiv:2403.00071

Resonance RoPE: Improving Context Length Generalization of Large Language Models

Published on Feb 29

· Submitted by

akhaliq on Mar 4

#3 Paper of the day

Upvote

Authors:

Suyuchen Wang ,

Mehdi Rezagholizadeh ,

Bang Liu

Abstract

This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE), where models pre-trained on shorter sequences face difficulty with out-of-distribution (OOD) token positions in longer sequences. We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios by refining the interpolation of RoPE features for OOD positions, significantly improving the model performance without additional online computational costs. Furthermore, we present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios, aiming to isolate the constantly increasing difficulty of token generation on long contexts from the challenges of recognizing new token positions. Our experiments on synthetic tasks show that after applying Resonance RoPE, Transformers recognize OOD position better and more robustly. Our extensive LLM experiments also show superior performance after applying Resonance RoPE to the current state-of-the-art RoPE scaling method, YaRN, on both upstream language modeling tasks and a variety of downstream long-text applications.

View arXiv page View PDF Add to collection

Community

leegao19

Mar 5

•

edited Mar 5

Interesting strategy, as I understand the method, it's to specifically tackle the pre-critical region (per the RoPE scaling law paper) - max training L < wavelength at a particular dimension.

It sound like the problem is that there's a feature gap in this pre-critical region due to the dynamics of non-integer wavelength/period and integer positions. If you envision your rotary encoding (say just the sin or cos term) for a particular dimension as a disk, then each position corresponds to some Δθ rotation. If your wavelength is non-integer but your positions are, then you're bound to overshoot after one period, and start your second rotation with a slight phase shift.

Say the period/wavelength here is 12.5, then after the 13th position, you start the second rotation with a slight phase shift of 0.5

I believe the idea is that the paper posits that in the pre-critical region (especially for higher frequency / early dimensions), these phase shifts are responsible for a lot of o.o.d behavior during interpolation, as the training may have never seen certain phase-shifted angles during short-sequence training.

To address this, they simply round the wavelength/period into integers so that there will never be a phase-shift (since the position increments are always 1). Formally, they prove that in the pre-critical region, using integer wavelengths guarantees a feature gap of zero across all possible embeddings.

One thing I will say is that it's interesting that this is the strategy. Previously, there have been several works to try to increase the "unique angles-seen coverage" to improve o.o.d interpolation performance (instead of avoiding them altogether). This is a markedly different philosophy to try to avoid having the model encounter / adapt to any o.o.d angles at least within the pre-critical dimensions.

For e.g.

https://huggingface.co/papers/2401.06951 (E^2-LLM): adds fractional position scaling (PI) + shift to add broader coverage to the angle space to help interpolation performance in case of o.o.d
https://huggingface.co/papers/2309.10400 (PoSE): similar to E^2-LLM, but without fractional PI scaling, and shifts can be discontiguous

Question for the paper/authors - could/does this result in less distinctness?

It sounds like this is primarily a potential concern for earlier dimensions (higher frequency) since they would be more likely to do several rotations. Is the idea that as long as the sequence length is < maximum wavelength, you can guarantee uniqueness (or is it a stronger claim? is this the 7e51 figure on page 5?)

sheryc

Paper author Mar 6

Hi, thank you for your interest in our paper! Although each RoPE feature repeats to model the interactions within a specific span length better, the overall position uniqueness is guaranteed by the entire RoPE feature vector that considers all RoPE dimensions together. This is why we mention the 7e51 value: it represents the unique (pre-critical part of) position embeddings Resonance RoPE can achieve when using LLaMA/LLaMA2's RoPE configurations, which is far from sufficient in real-world applications.