Papers
arxiv:2402.03791

Adaptive Blockwise Task-interleaved Pipeline Parallelism

Published on Feb 6
Authors:
,
,
,
,
,
,

Abstract

Efficient distributed training serves as a powerful catalyst and an essential foundation for the development of large-scale neural networks. In distributed training scenarios, various pipeline parallelism methods are cleverly designed and widely employed. In this paper, we propose ZeroPP, a highly efficient and flexible pipeline parallelism method that trades off pipeline bubbles, memory usage, and communication through adaptive scheduling units. ZeroPP achieves minimal pipeline bubbles by carefully staggering the computation tasks of forward, input gradient, and weight gradient within a scheduling unit. Additionally, ZeroPP optimizes the combination of pipeline parallelism and fully sharded data parallelism using a blockwise schedule. We conduct experiments with popular GPT-style models and observe up to a 30% increase in throughput compared to the state-of-the-art breath-first pipeline parallelism. Besides, our evaluation also demonstrates up to a 68% increase in throughput and a 10% reduction in memory consumption compared to the memory-efficient 1F1B method.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2402.03791 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2402.03791 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2402.03791 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.