World Model on Million-Length Video And Language With RingAttention

Published on Feb 13
· Featured in Daily Papers on Feb 14


Current language models fall short in understanding aspects of the world not easily described in words, and struggle with complex, long-form tasks. Video sequences offer valuable temporal information absent in language and static images, making them attractive for joint modeling with language. Such models could develop a understanding of both human textual knowledge and the physical world, enabling broader AI capabilities for assisting humans. However, learning from millions of tokens of video and language sequences poses challenges due to memory constraints, computational complexity, and limited datasets. To address these challenges, we curate a large dataset of diverse videos and books, utilize the RingAttention technique to scalably train on long sequences, and gradually increase context size from 4K to 1M tokens. This paper makes the following contributions: (a) Largest context size neural network: We train one of the largest context size transformers on long video and language sequences, setting new benchmarks in difficult retrieval tasks and long video understanding. (b) Solutions for overcoming vision-language training challenges, including using masked sequence packing for mixing different sequence lengths, loss weighting to balance language and vision, and model-generated QA dataset for long sequence chat. (c) A highly-optimized implementation with RingAttention, masked sequence packing, and other key features for training on millions-length multimodal sequences. (d) Fully open-sourced a family of 7B parameter models capable of processing long text documents (LWM-Text, LWM-Text-Chat) and videos (LWM, LWM-Chat) of over 1M tokens. This work paves the way for training on massive datasets of long video and language to develop understanding of both human knowledge and the multimodal world, and broader capabilities.


This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

super weird Gemini 1.5 is receiving all th praise for its long-context capabilities, when you guys achieved the same thing, argubly more efficiently in dense models with slightly better performance is WILD. so flowers to you guys....GREAT WORK!

Hi folks, I'm trying to understand Ring Attention. Is my reading of this approach correct?

  1. Start with N devices connected together in a ring topology
  2. Each of the N devices is assigned a sequence-block and is responsible for computing the e2e output from input to output for some (all?) layers - e.g. this is a form of sequence parallelism
    • I'm guessing there's still standard pipeline parallelism? E.g. we still have different groups/pods of devices assigned to different groups of layers?
  3. For each sequence block, concurrently calculate the partial softmaxed attention scores (which requires cycling through each set of kv-blocks)
    • At each inner-round (to cycle through the kv-blocks), calculate + accumulate the partial attention scores for the current kv-block we hold (GEMM, compute bound)
    • Simultaneously, send/recv (via triple buffering?) the kv-blocks for the next round.
  4. As soon as the final attention is available, start the blockwise FFN

There's an optimal chunk size that varies with the link bandwidth to achieve communication-computation overlap (assuming a fixed device profile for GEMM compute throughput), with the intuition that slower links require larger minimum chunk sizes.

In particular, this is different from "brute-force" sequence parallelism prior to Ring Attention in that this doesn't just depend on naive Scatter/All-Gather schemes which:

  1. Superimposes a forced synchronization point (for K) before the $qK^T$ (to be fair, RingAttention still need to synchronize before we use k,v, but it can still incrementally advance the partial attention without waiting for the All-Gather to complete, making it possible to fully overlap communication), and
  2. May cause the per-device attention memory usage to be unbounded as the sequence length increases to infinity

Is that the right idea - This is the only way to fully overlap communication/compute through a progressive blockwise attention scheme, while keeping the per-device memory usage bounded.


Additionally, how are the RoPE base frequencies tuned? Do they follow some specific scaling recipe (e.g.


Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite in a model to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite in a dataset to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite in a Space to link it from this page.

Collections including this paper 25