| :orphan: |
|
|
| .. _usage_launch_distributed: |
|
|
| Launching Distributed Programs |
| ============================== |
|
|
| .. currentmodule:: mlx.core.distributed |
|
|
| Installing the MLX python package provides a helper script ``mlx.launch`` that |
| can be used to run python scripts distributed on several nodes. It allows |
| launching using either the MPI backend or the ring backend. See the |
| :doc:`distributed docs <distributed>` for the different backends. |
|
|
| Usage |
| ----- |
|
|
| The minimal usage example of ``mlx.launch`` is simply |
|
|
| .. code:: shell |
|
|
| mlx.launch --hosts ip1,ip2 my_script.py |
|
|
| or for testing on localhost |
|
|
| .. code:: shell |
|
|
| mlx.launch -n 2 my_script.py |
|
|
| The ``mlx.launch`` command connects to the provided host and launches the input |
| script on each host. It monitors each of the launched processes and terminates |
| the rest if one of them fails unexpectedly or if ``mlx.launch`` is terminated. |
| It also takes care of forwarding the output of each remote process to stdout |
| and stderr respectively. |
|
|
| Providing Hosts |
| ^^^^^^^^^^^^^^^^ |
|
|
| Hosts can be provided as command line arguments, like above, but the way that |
| allows to fully define a list of hosts is via a JSON hostfile. The hostfile has |
| a very simple schema. It is simply a list of objects that define each host via |
| a hostname to ssh to and a list of IPs to utilize for the communication. |
|
|
| .. code:: json |
|
|
| [ |
| {"ssh": "hostname1", "ips": ["123.123.1.1", "123.123.2.1"]}, |
| {"ssh": "hostname2", "ips": ["123.123.1.2", "123.123.2.2"]} |
| ] |
|
|
| You can use ``mlx.distributed_config --over ethernet`` to create a hostfile |
| with IPs corresponding to the ``en0`` interface. |
|
|
| Setting up Remote Hosts |
| ^^^^^^^^^^^^^^^^^^^^^^^^ |
|
|
| In order to be able to launch the script on each host we need to be able to |
| connect via ssh. Moreover the input script and python binary need to be on each |
| host and on the same path. A good checklist to debug errors is the following: |
|
|
| * ``ssh hostname`` works without asking for password or host confirmation |
| * the python binary is available on all hosts at the same path. You can use |
| ``mlx.launch --print-python`` to see what that path is. |
| * the script you want to run is available on all hosts at the same path |
|
|
| .. _mpi_specifics: |
|
|
| MPI Specifics |
| ------------- |
|
|
| One can use MPI by passing ``--backend mpi`` to ``mlx.launch``. In that case, |
| ``mlx.launch`` is a thin wrapper over ``mpirun``. Moreover, |
|
|
| * The IPs in the hostfile are ignored |
| * The ssh connectivity requirement is stronger as every node needs to be able |
| to connect to every other node |
| * ``mpirun`` needs to be available on every node at the same path |
|
|
| Finally, one can pass arguments to ``mpirun`` using ``--mpi-arg``. For instance |
| to choose a specific interface for the byte-transfer-layer of MPI we can call |
| ``mlx.launch`` as follows: |
|
|
| .. code:: shell |
|
|
| mlx.launch --backend mpi --mpi-arg '--mca btl_tcp_if_include en0' --hostfile hosts.json my_script.py |
|
|
|
|
| .. _ring_specifics: |
|
|
| Ring Specifics |
| -------------- |
|
|
| The ring backend, which is also the default backend, can be explicitly selected |
| with the argument ``--backend ring``. The ring backend has some specific |
| requirements and arguments that are different to MPI: |
|
|
| * The argument ``--hosts`` only accepts IPs and not hostnames. If we need to |
| ssh to a hostname that does not correspond to the IP we want to bind to we |
| have to provide a hostfile. |
| * ``--starting-port`` defines the port to bind to on the remote hosts. |
| Specifically rank 0 for the first IP will use this port and each subsequent |
| IP or rank will add 1 to this port. |
| * ``--connections-per-ip`` allows us to increase the number of connections |
| between neighboring nodes. This corresponds to ``--mca btl_tcp_links 2`` for |
| ``mpirun``. |
|
|