arxiv:2402.03791

Adaptive Blockwise Task-interleaved Pipeline Parallelism

Published on Feb 6, 2024

Authors:

Abstract

Efficient distributed training serves as a powerful catalyst and an essential foundation for the development of large-scale neural networks. In distributed training scenarios, various pipeline parallelism methods are cleverly designed and widely employed. In this paper, we propose ZeroPP, a highly efficient and flexible pipeline parallelism method that trades off pipeline bubbles, memory usage, and communication through adaptive scheduling units. ZeroPP achieves minimal pipeline bubbles by carefully staggering the computation tasks of forward, input gradient, and weight gradient within a scheduling unit. Additionally, ZeroPP optimizes the combination of pipeline parallelism and fully sharded data parallelism using a blockwise schedule. We conduct experiments with popular GPT-style models and observe up to a 30% increase in throughput compared to the state-of-the-art breath-first pipeline parallelism. Besides, our evaluation also demonstrates up to a 68% increase in throughput and a 10% reduction in memory consumption compared to the memory-efficient 1F1B method.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment

No model linking this paper

Cite arxiv.org/abs/2402.03791 in a model README.md to link it from this page.

No dataset linking this paper

Cite arxiv.org/abs/2402.03791 in a dataset README.md to link it from this page.

No Space linking this paper

Cite arxiv.org/abs/2402.03791 in a Space README.md to link it from this page.

No Collection including this paper

Add this paper to a collection to link it from this page.