Data Engineering for Scaling Language Models to 128K Context

Published on Feb 15
· Submitted by akhaliq on Feb 16
Yao Fu ,


We study the continual pretraining recipe for scaling language models' context lengths to 128K, with a focus on data engineering. We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired through large-scale pretraining, and that this capability can be readily extended to contexts substantially longer than seen during training~(e.g., 4K to 128K) through lightweight continual pretraining on appropriate data mixture. We investigate the quantity and quality of the data for continual pretraining: (1) for quantity, we show that 500 million to 5 billion tokens are enough to enable the model to retrieve information anywhere within the 128K context; (2) for quality, our results equally emphasize domain balance and length upsampling. Concretely, we find that naively upsampling longer data on certain domains like books, a common practice of existing work, gives suboptimal performance, and that a balanced domain mixture is important. We demonstrate that continual pretraining of the full model on 1B-5B tokens of such data is an effective and affordable strategy for scaling the context length of language models to 128K. Our recipe outperforms strong open-source long-context models and closes the gap to frontier models like GPT-4 128K.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Is the idea here mainly:

  1. Data - (novel contribution) continual pretraining while perserving the pretraining data mixture (avoid biasing benchmark performance in other areas, in contrast to e.g. just training on long-form books)
  2. Architecture - minimal changes beyond Adjusted Base Frequency (changing the base from 10m000 to 500,000 à la Code LLaMA).
  3. Training - with recent sub-quadratic memory optimizations (Flash Attention), brute-force training with long sequences is no longer prohibitively expensive, and a large part of the latency bottleneck has shifted to linear IO cost (for < ~50K sequences). I believe FlashAttention 2 proposes a double-buffering technique that can also help "overlap" these IO and GEMM costs to avoid serializing on them.

I believe also proposes something very similar (continual pretraining using 500000 ABF as the only minor architectural change), but using lots of tokens for continual pretraining and without preserving the same pretraining data mixture.

Paper author
edited Feb 21

I tend to view the contribution is data and data alone, not only the data composition but also the data scale.

When comparing this work with, note a foundamental difference is that we hypothesize that the long-context capability is already within the base model, and one only needs very light weight continue pretrain to unlock it, i.e. only use 5B data. This is a good news for research and open source.

But (implicitly) holds the opposite belief that the long context capability is NOT within the base model, and they continue pretrain on 400B tokens. This sends an inaccurate and costly message to the community, as it indicates long context can be as expensive as pretraining.

Consequently, imagine a company trying to build long context model. Before our paper, suppose they follow, then they may need to spend 128 A100s for two weeks. After knowing our message, they can reduce their cost to 8 A100s of 5 days. This is a million dollar cost reduction.

And it already happened :)

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 10