Spaces:
Runtime error
Runtime error
# Copyright (c) Facebook, Inc. and its affiliates. | |
# All rights reserved. | |
# | |
# This source code is licensed under the BSD-style license found in the | |
# LICENSE file in the root directory of this source tree. | |
""" | |
In the context of Torch Distributed Elastic we use the term *rendezvous* to | |
refer to a particular functionality that combines a **distributed | |
synchronization** primitive with **peer discovery**. | |
It is used by Torch Distributed Elastic to gather participants of a training | |
job (i.e. nodes) such that they all agree on the same list of participants and | |
everyone's roles, as well as make a consistent collective decision on when | |
training can begin/resume. | |
Torch Distributed Elastic rendezvous provides the following critical | |
functionalities: | |
**Barrier**: | |
Nodes performing rendezvous will all block until the rendezvous is considered | |
complete - this happens when at least ``min`` total number of nodes have joined | |
the rendezvous barrier (for the same job). This also implies the barrier is not | |
necessarily of fixed size. | |
There's an additional small waiting time after reaching ``min`` number of | |
nodes - this is used to ensure the rendezvous is not completed "too quickly" | |
(which could potentially exclude additional nodes attempting to join at | |
approximately the same time). | |
If ``max`` number of nodes is gathered at the barrier, the rendezvous is | |
completed immediately. | |
There's also an overall timeout which causes the rendezvous to fail if ``min`` | |
number of nodes is never reached - this is meant to be a simple fail-safe to | |
help release partially allocated job resources, in case there's a problem with | |
the resource manager, and is meant to be interpreted as non-retryable. | |
**Exclusivity**: | |
A simple distributed barrier would not be sufficient, as we also need to ensure | |
that only one group of nodes exists at any given time (for a given job). In | |
other words, new nodes (i.e. joining late) should not be able to form a parallel | |
independent group of workers for the same job. | |
Torch Distributed Elastic rendezvous ensures that if a group of nodes has | |
already completed a rendezvous (and hence might already be training), then | |
additional "late" nodes attempting to rendezvous will only announce themselves | |
as waiting, and will have to wait until the (previously completed) existing | |
rendezvous is destroyed first. | |
**Consistency**: | |
When a rendezvous is completed, all its members will agree on the job membership | |
and everyone's role in it. This role is represented using an integer, called | |
rank, that is between between 0 and world size. | |
Note that ranks are *not stable*, in the sense that the same node can be | |
assigned a different rank in the next (re-)rendezvous. | |
**Fault-tolerance**: | |
Torch Distributed Elastic rendezvous is designed to tolerate node failures | |
during the rendezvous process. Should a process crash (or lose network | |
connectivity, etc), between joining the rendezvous and it being completed, then | |
a re-rendezvous with remaining healthy nodes will happen automatically. | |
A node can also fail *after* it has completed (or *has been observered* by other | |
nodes to have completed) the rendezvous - this scenario will be handled by the | |
Torch Distributed Elastic ``train_loop`` instead (where it will also trigger a | |
re-rendezvous). | |
**Shared key-value store**: | |
When the rendezvous is completed, a shared key-value store is created and | |
returned. This store implements a ``torch.distributed.Store`` API (see | |
`distributed communication docs | |
<https://pytorch.org/docs/stable/distributed.html>`__). | |
This store is only shared by the members of the completed rendezvous. It | |
is intended to be used by Torch Distributed Elastic to exchange information | |
necessary to initialize job control and data-planes. | |
**Waiting workers and rendezvous closing**: | |
Torch Distributed Elastic rendezvous handler object provides additional | |
functionalities, which are technically not part of the rendezvous process: | |
1. Querying how many workers arrived late at the barrier, who can participate in | |
*next* rendezvous. | |
2. Setting the rendezvous *closed* to signal all nodes not to participate in | |
next rendezvous. | |
**DynamicRendezvousHandler**: | |
Torch Distributed Elastic comes with the :py:class:`.DynamicRendezvousHandler` | |
class that implements the rendezvous mechanism described above. It is a backend- | |
agnostic type that expects a particular :py:class:`.RendezvousBackend` instance | |
to be specified during construction. | |
Torch distributed users can either implement their own backend type or use one | |
of the following implementations that come with PyTorch: | |
- :py:class:`.C10dRendezvousBackend`: Uses a C10d store (by default | |
``TCPStore``) as the rendezvous backend. The main advantage of using a C10d | |
store is that it requires no 3rd-party dependency (such as etcd) to establish | |
a rendezvous. | |
- :py:class:`.EtcdRendezvousBackend`: Supersedes the legacy | |
:py:class:`.EtcdRendezvousHandler` class. Passing an | |
:py:class:`.EtcdRendezvousBackend` instance to | |
:py:class:`.DynamicRendezvousHandler` is functionally equivalent to | |
instantiating an :py:class:`.EtcdRendezvousHandler`. | |
:: | |
store = TCPStore("localhost") | |
backend = C10dRendezvousBackend(store, "my_run_id") | |
rdzv_handler = DynamicRendezvousHandler.from_backend( | |
run_id="my_run_id", | |
store=store, | |
backend=backend, | |
min_nodes=2, | |
max_nodes=4 | |
) | |
""" | |
from .api import * # noqa: F403 | |
from .registry import _register_default_handlers | |
_register_default_handlers() | |
__all__ = [ | |
"RendezvousClosedError", | |
"RendezvousConnectionError", | |
"RendezvousError", | |
"RendezvousHandler", | |
"RendezvousHandlerCreator", | |
"RendezvousHandlerRegistry", | |
"RendezvousParameters", | |
"RendezvousStateError", | |
"RendezvousTimeoutError", | |
"rendezvous_handler_registry", | |
] | |