arxiv:2405.07719

A Unified Sequence Parallelism Approach for Long Context Generative AI

Published on May 13

Authors:

Abstract

Sequence parallelism (SP), which divides the sequence dimension of input tensors across multiple computational devices, is becoming key to unlocking the long-context capabilities of generative AI models. This paper investigates the state-of-the-art SP approaches, i.e. DeepSpeed-Ulysses and Ring-Attention, and proposes a unified SP approach, which is more robust to transformer model architectures and network hardware topology. This paper compares the communication and memory cost of SP and existing parallelism, including data/tensor/zero/expert/pipeline parallelism, and discusses the best practices for designing hybrid 4D parallelism involving SP. We achieved 86% MFU on two 8xA800 nodes using SP for sequence length 208K for the LLAMA3-8B model. Our code is publicly available on https://github.com/feifeibear/long-context-attention.

View arXiv page View PDF Add to collection

Community

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Upvote

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2405.07719 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.07719 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2405.07719 in a Space README.md to link it from this page.