Wan Xinyi commited on
Commit
29c9647
1 Parent(s): 4b2c8d9

Update description

Browse files
Files changed (3) hide show
  1. app.py +3 -1
  2. description1.md +34 -0
  3. description2.md +10 -0
app.py CHANGED
@@ -64,7 +64,8 @@ def calculate(p, m, f, b, w, c, mem):
64
  return [baseline_time, baseline_bubble, baseline_acceleration, baseline_image, zb_time, zb_bubble, zb_acceleration, zb_image, zbv_time, zbv_bubble, zbv_acceleration, zbv_image]
65
 
66
  with gr.Blocks() as demo:
67
- gr.Markdown("Zero bubble pipeline parallel bubble calculator")
 
68
  with gr.Row():
69
  with gr.Column(scale=1):
70
  with gr.Group():
@@ -123,4 +124,5 @@ with gr.Blocks() as demo:
123
  with gr.Column(scale=4):
124
  zbv_image=gr.Image(None, interactive=False, label="Schedule Image")
125
  button.click(calculate, inputs=[p, m, f, b, w, c, mem], outputs=[baseline_time, baseline_bubble, baseline_acceleration, baseline_image, zb_time, zb_bubble, zb_acceleration, zb_image, zbv_time, zbv_bubble, zbv_acceleration, zbv_image])
 
126
  demo.launch()
 
64
  return [baseline_time, baseline_bubble, baseline_acceleration, baseline_image, zb_time, zb_bubble, zb_acceleration, zb_image, zbv_time, zbv_bubble, zbv_acceleration, zbv_image]
65
 
66
  with gr.Blocks() as demo:
67
+ gr.Markdown(open("description1.md").read())
68
+ gr.Markdown("# Pipeline Scheduler Playground")
69
  with gr.Row():
70
  with gr.Column(scale=1):
71
  with gr.Group():
 
124
  with gr.Column(scale=4):
125
  zbv_image=gr.Image(None, interactive=False, label="Schedule Image")
126
  button.click(calculate, inputs=[p, m, f, b, w, c, mem], outputs=[baseline_time, baseline_bubble, baseline_acceleration, baseline_image, zb_time, zb_bubble, zb_acceleration, zb_image, zbv_time, zbv_bubble, zbv_acceleration, zbv_image])
127
+ gr.Markdown(open("description2.md").read())
128
  demo.launch()
description1.md ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Zero Bubble Pipeline Parallelism
2
+
3
+ Zero Bubble Pipeline Parallelism is a novel pipeline parallelism algorithm able to reduce the bubble of pipeline parallelism to almost zero while preserving synchronous semantics.
4
+
5
+ Our paper is coming soon.
6
+
7
+ Try out our implementation based on Megatron on [https://github.com/sail-sg/zero-bubble-pipeline-parallelism](https://github.com/sail-sg/zero-bubble-pipeline-parallelism)
8
+
9
+ Experiments shows zero bubble pipeline parallelism can accelerate training up to 30% with a similar memory comsumption. A detailed table of experiments is coming soon.
10
+
11
+ ## Zero Bubble Schedules
12
+ The key of achieving zero bubble is to breaking a backward pass into a B pass and W pass. B on one stage will only depend on the B on its next stage, compared to depending on both B and W of in 1F1B.
13
+
14
+ ![image](https://hackmd.io/_uploads/Bkc7CL7N6.png)
15
+
16
+ ### Comparision of Schedules
17
+ * 1F1B
18
+ ![image](https://hackmd.io/_uploads/Hkq-gD7N6.png)
19
+ * ZB1P
20
+ ![image](https://hackmd.io/_uploads/Hy2GxwmEa.png)
21
+ * ZB2P
22
+ ![image](https://hackmd.io/_uploads/S10QgvmV6.png)
23
+ * ZBV - Each device is assigned to exactly 2 chunks (virtual stages), where white text colors represent the first chunk and black text colors represent the second chunk. The sequence of dependencies among model chunks follows a ”V” shape pattern for both the forward and backward passes.
24
+ ![image](https://hackmd.io/_uploads/Sk9uyY4ra.png)
25
+
26
+
27
+
28
+
29
+ | Comparison assuming T_F=T_B=T_W | 1F1B | ZB1P | ZB2P | ZBV (Recommended) |
30
+ | ----------------------------------------------------- | ------- | -------- | ---- | --- |
31
+ | Bubble Rate | (p-1)/m | (p-1)/3m | 0 | 0 |
32
+ | Activation Memory <br> (Compared to 1F1B) | 1x | 1x | 2x | 1x |
33
+ | Pipeline Communication Volume <br> (Compared to 1F1B) | 1x | 1x | 1x | 2x |
34
+
description2.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+
2
+
3
+ ## Optimizer Post Validation
4
+
5
+ In most practices of PP there's an all-reduce cross all pipeline stages for numerical robustness, e.g. global gradient norm for gradient clipping. INF/NAN check for mixed precision training, etc. This all-reduce breaks parallelogram and makes zero bubble impossible.
6
+ Under the observation that during a stable training both the gradient clipping and INF/NAN rarely triggers, we replace the before-hand synchronizations with a post update validation.
7
+
8
+ ![image](https://hackmd.io/_uploads/B16R3q4N6.png)
9
+
10
+ We eagerly step the optimizers assuming the grad cliping, INF/NAN conditions are not triggered. In case an amendment to the gradient is required, a rollback will be issued and then we redo the optimizer step based on the fully reduced global state.