| | |
| | r""" |
| | Module ``torch.distributed.launch``. |
| | |
| | ``torch.distributed.launch`` is a module that spawns up multiple distributed |
| | training processes on each of the training nodes. |
| | |
| | .. warning:: |
| | |
| | This module is going to be deprecated in favor of :ref:`torchrun <launcher-api>`. |
| | |
| | The utility can be used for single-node distributed training, in which one or |
| | more processes per node will be spawned. The utility can be used for either |
| | CPU training or GPU training. If the utility is used for GPU training, |
| | each distributed process will be operating on a single GPU. This can achieve |
| | well-improved single-node training performance. It can also be used in |
| | multi-node distributed training, by spawning up multiple processes on each node |
| | for well-improved multi-node distributed training performance as well. |
| | This will especially be beneficial for systems with multiple Infiniband |
| | interfaces that have direct-GPU support, since all of them can be utilized for |
| | aggregated communication bandwidth. |
| | |
| | In both cases of single-node distributed training or multi-node distributed |
| | training, this utility will launch the given number of processes per node |
| | (``--nproc-per-node``). If used for GPU training, this number needs to be less |
| | or equal to the number of GPUs on the current system (``nproc_per_node``), |
| | and each process will be operating on a single GPU from *GPU 0 to |
| | GPU (nproc_per_node - 1)*. |
| | |
| | **How to use this module:** |
| | |
| | 1. Single-Node multi-process distributed training |
| | |
| | :: |
| | |
| | python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE |
| | YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 and all other |
| | arguments of your training script) |
| | |
| | 2. Multi-Node multi-process distributed training: (e.g. two nodes) |
| | |
| | |
| | Node 1: *(IP: 192.168.1.1, and has a free port: 1234)* |
| | |
| | :: |
| | |
| | python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE |
| | --nnodes=2 --node-rank=0 --master-addr="192.168.1.1" |
| | --master-port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 |
| | and all other arguments of your training script) |
| | |
| | Node 2: |
| | |
| | :: |
| | |
| | python -m torch.distributed.launch --nproc-per-node=NUM_GPUS_YOU_HAVE |
| | --nnodes=2 --node-rank=1 --master-addr="192.168.1.1" |
| | --master-port=1234 YOUR_TRAINING_SCRIPT.py (--arg1 --arg2 --arg3 |
| | and all other arguments of your training script) |
| | |
| | 3. To look up what optional arguments this module offers: |
| | |
| | :: |
| | |
| | python -m torch.distributed.launch --help |
| | |
| | |
| | **Important Notices:** |
| | |
| | 1. This utility and multi-process distributed (single-node or |
| | multi-node) GPU training currently only achieves the best performance using |
| | the NCCL distributed backend. Thus NCCL backend is the recommended backend to |
| | use for GPU training. |
| | |
| | 2. In your training program, you must parse the command-line argument: |
| | ``--local-rank=LOCAL_PROCESS_RANK``, which will be provided by this module. |
| | If your training program uses GPUs, you should ensure that your code only |
| | runs on the GPU device of LOCAL_PROCESS_RANK. This can be done by: |
| | |
| | Parsing the local_rank argument |
| | |
| | :: |
| | |
| | >>> # xdoctest: +SKIP |
| | >>> import argparse |
| | >>> parser = argparse.ArgumentParser() |
| | >>> parser.add_argument("--local-rank", "--local_rank", type=int) |
| | >>> args = parser.parse_args() |
| | |
| | Set your device to local rank using either |
| | |
| | :: |
| | |
| | >>> torch.cuda.set_device(args.local_rank) # before your code runs |
| | |
| | or |
| | |
| | :: |
| | |
| | >>> with torch.cuda.device(args.local_rank): |
| | >>> # your code to run |
| | >>> ... |
| | |
| | .. versionchanged:: 2.0.0 |
| | |
| | The launcher will passes the ``--local-rank=<rank>`` argument to your script. |
| | From PyTorch 2.0.0 onwards, the dashed ``--local-rank`` is preferred over the |
| | previously used underscored ``--local_rank``. |
| | |
| | For backward compatibility, it may be necessary for users to handle both |
| | cases in their argument parsing code. This means including both ``"--local-rank"`` |
| | and ``"--local_rank"`` in the argument parser. If only ``"--local_rank"`` is |
| | provided, the launcher will trigger an error: "error: unrecognized arguments: |
| | --local-rank=<rank>". For training code that only supports PyTorch 2.0.0+, |
| | including ``"--local-rank"`` should be sufficient. |
| | |
| | 3. In your training program, you are supposed to call the following function |
| | at the beginning to start the distributed backend. It is strongly recommended |
| | that ``init_method=env://``. Other init methods (e.g. ``tcp://``) may work, |
| | but ``env://`` is the one that is officially supported by this module. |
| | |
| | :: |
| | |
| | >>> torch.distributed.init_process_group(backend='YOUR BACKEND', |
| | >>> init_method='env://') |
| | |
| | 4. In your training program, you can either use regular distributed functions |
| | or use :func:`torch.nn.parallel.DistributedDataParallel` module. If your |
| | training program uses GPUs for training and you would like to use |
| | :func:`torch.nn.parallel.DistributedDataParallel` module, |
| | here is how to configure it. |
| | |
| | :: |
| | |
| | >>> model = torch.nn.parallel.DistributedDataParallel(model, |
| | >>> device_ids=[args.local_rank], |
| | >>> output_device=args.local_rank) |
| | |
| | Please ensure that ``device_ids`` argument is set to be the only GPU device id |
| | that your code will be operating on. This is generally the local rank of the |
| | process. In other words, the ``device_ids`` needs to be ``[args.local_rank]``, |
| | and ``output_device`` needs to be ``args.local_rank`` in order to use this |
| | utility |
| | |
| | 5. Another way to pass ``local_rank`` to the subprocesses via environment variable |
| | ``LOCAL_RANK``. This behavior is enabled when you launch the script with |
| | ``--use-env=True``. You must adjust the subprocess example above to replace |
| | ``args.local_rank`` with ``os.environ['LOCAL_RANK']``; the launcher |
| | will not pass ``--local-rank`` when you specify this flag. |
| | |
| | .. warning:: |
| | |
| | ``local_rank`` is NOT globally unique: it is only unique per process |
| | on a machine. Thus, don't use it to decide if you should, e.g., |
| | write to a networked filesystem. See |
| | https://github.com/pytorch/pytorch/issues/12042 for an example of |
| | how things can go wrong if you don't do this correctly. |
| | |
| | |
| | |
| | """ |
| |
|
| | from typing_extensions import deprecated as _deprecated |
| |
|
| | from torch.distributed.run import get_args_parser, run |
| |
|
| |
|
| | def parse_args(args): |
| | parser = get_args_parser() |
| | parser.add_argument( |
| | "--use-env", |
| | "--use_env", |
| | default=False, |
| | action="store_true", |
| | help="Use environment variable to pass " |
| | "'local rank'. For legacy reasons, the default value is False. " |
| | "If set to True, the script will not pass " |
| | "--local-rank as argument, and will instead set LOCAL_RANK.", |
| | ) |
| | return parser.parse_args(args) |
| |
|
| |
|
| | def launch(args): |
| | if args.no_python and not args.use_env: |
| | raise ValueError( |
| | "When using the '--no-python' flag, you must also set the '--use-env' flag." |
| | ) |
| | run(args) |
| |
|
| |
|
| | @_deprecated( |
| | "The module torch.distributed.launch is deprecated\n" |
| | "and will be removed in future. Use torchrun.\n" |
| | "Note that --use-env is set by default in torchrun.\n" |
| | "If your script expects `--local-rank` argument to be set, please\n" |
| | "change it to read from `os.environ['LOCAL_RANK']` instead. See \n" |
| | "https://pytorch.org/docs/stable/distributed.html#launch-utility for \n" |
| | "further instructions\n", |
| | category=FutureWarning, |
| | ) |
| | def main(args=None): |
| | args = parse_args(args) |
| | launch(args) |
| |
|
| |
|
| | if __name__ == "__main__": |
| | main() |
| |
|