File size: 783 Bytes
29c9647 |
1 2 3 4 5 6 7 8 9 10 |
## Optimizer Post Validation
In most practices of PP there's an all-reduce cross all pipeline stages for numerical robustness, e.g. global gradient norm for gradient clipping. INF/NAN check for mixed precision training, etc. This all-reduce breaks parallelogram and makes zero bubble impossible.
Under the observation that during a stable training both the gradient clipping and INF/NAN rarely triggers, we replace the before-hand synchronizations with a post update validation.
data:image/s3,"s3://crabby-images/a41d0/a41d0b32b8f54e0b927e50697796d47200baf8cf" alt="image"
We eagerly step the optimizers assuming the grad cliping, INF/NAN conditions are not triggered. In case an amendment to the gradient is required, a rollback will be issued and then we redo the optimizer step based on the fully reduced global state. |