| Multinode Training |
| ================== |
|
|
| Last updated: 06/10/2025. |
|
|
| .. _wuxibin89: https://github.com/wuxibin89 |
|
|
| Author: `Xibin Wu <https://github.com/wuxibin89>`_, `Yusheng Su <https://yushengsu-thu.github.io/>`_. |
|
|
| Option 1: Launch Manually |
| ------------------------------ |
|
|
| Set up multinode ray cluster |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 1. Start head node with ``ray start --head --dashboard-host=0.0.0.0``, there're 2 address you should care about: |
|
|
| - GCS address: ``ray start --address=<address>``, where worker node should connect to. |
| - Dashboard address: ``<address>:8265``, where you should submit job to the cluster. |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/head.png?raw=true |
|
|
| 2. Start worker node with ``ray start --address=<address>`` you get above. |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/worker.png?raw=true |
|
|
| 3. Now you should see the cluster have 2 nodes with ``ray status``. |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/status.png?raw=true |
|
|
| 4. Additionally, you can access dashboard in the browser with the address you get above. |
|
|
| *Firewall rules maybe need configure to access the dashboard, if there's any trouble, please contact your network administrator.* |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/overview.png?raw=true |
|
|
| Submit job to ray cluster |
| ~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 1. Submit ray job to cluster with the dashboard address you get above. |
|
|
| .. code-block:: bash |
|
|
| ray job submit --address="http://127.0.0.1:8265" \ |
| --runtime-env=verl/trainer/runtime_env.yaml \ |
| --no-wait \ |
| -- \ |
| python3 -m verl.trainer.main_ppo \ |
| trainer.n_gpus_per_node=8 \ |
| trainer.nnodes=2 \ |
| ... |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/submit.png?raw=true |
|
|
| 2. Then you can check the job status with the following commands: |
|
|
| - ray job list: list all jobs submitted to the cluster. |
| - ray job logs <Submission ID>: query the logs of the job. |
| - ray job status <Submission ID>: query the status of the job. |
| - ray job stop <Submission ID>: request the job to be stopped. |
| - ray job list | grep submission_id | grep JobStatus | grep RUNNING | grep -oP 'raysubmit_[^'\''"]+' | head -n 1: get the latest job submission ID of the running job. |
| - ray job logs <Submission ID> --follow: added ``--follow`` parameter to ray job logs command to enable continuous log streaming. |
|
|
| 3. You can also access driver/task/actor logs in ``/tmp/ray/session_latest/logs/``, driver log is ``job-driver-raysubmit_<Submission ID>.log``. |
|
|
| 4. We strongly recommend you to view job detail from dashboard in multinode training, because it provide more structure way to view the job information. |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job.png?raw=true |
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/job_detail.png?raw=true |
|
|
| Option 2: Launch via SkyPilot on Kubernetes or clouds |
| ------------------------------------------------------ |
|
|
| .. note:: |
| Ready-to-use SkyPilot example configurations are available in the `examples/skypilot/ <https://github.com/volcengine/verl/tree/main/examples/skypilot>`_ directory: |
| |
| - ``verl-ppo.yaml`` - PPO training with GSM8K dataset |
| - ``verl-grpo.yaml`` - GRPO training with MATH dataset |
| - ``verl-multiturn-tools.yaml`` - Multi-turn tool usage training |
| |
| See the `SkyPilot examples README <https://github.com/volcengine/verl/tree/main/examples/skypilot>`_ for detailed usage instructions. |
|
|
| Step 1: Setup SkyPilot |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| SkyPilot can support different clouds, here we use GCP as example. `install skypilot <https://docs.skypilot.co/en/latest/getting-started/installation.html>`_ |
|
|
| .. code-block:: bash |
|
|
| conda create -y -n sky python=3.10 |
| conda activate sky |
| pip install "skypilot[gcp]" |
|
|
| conda install -c conda-forge google-cloud-sdk |
| gcloud init |
|
|
| # Run this if you don't have a credential file. |
| # This will generate ~/.config/gcloud/application_default_credentials.json. |
| gcloud auth application-default login |
|
|
| # Check if the GCP credential is correctly setup. |
| sky check gcp |
|
|
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/setup_skypilot.png?raw=true |
|
|
| Step 2: Prepare dataset |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| .. code-block:: bash |
|
|
| git clone https://github.com/volcengine/verl.git |
| cd examples/data_preprocess |
| python3 gsm8k.py --local_save_dir ~/data/gsm8k |
|
|
|
|
| Step 3: Submit a job with SkyPilot |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
| 1. Create a SkyPilot YAML ``verl-cluster.yml`` with the following content: |
|
|
| .. parsed-literal:: workdir: . will sync all the data in the current dir to the remote cluster. |
|
|
| .. code-block:: yaml |
|
|
| resources: |
| accelerators: L4:1 # every node has 1 L4 GPU |
| image_id: docker:verlai/verl:base-verl0.5-cu126-cudnn9.8-torch2.7.0-fa2.7.4 |
| memory: 64+ # every node has 64 GB memory |
| ports: 8265 # expose port for ray dashboard |
|
|
| num_nodes: 2 # cluster size |
|
|
| # --------------- Work Directory Synchronization (workdir) --------------- |
| # Defines the local working directory to be synchronized to the remote cluster. |
| # Here, '.' means synchronizing the directory where the sky submit command is currently run. |
| workdir: . |
|
|
| # --------------- (secrets) --------------- |
| secrets: |
| ## your wandb api key ## |
| WANDB_API_KEY: null |
|
|
| # --------------- File Mounts/Data Upload (file_mounts) --------------- |
| # If your dataset (gsm8k folder) is local, it needs to be uploaded to the remote cluster. |
| file_mounts: |
| # Remote path (relative to remote user's home directory): Local path |
| # /remote/dir1/file: /local/dir1/file |
| data/gsm8k: ~/data/gsm8k |
|
|
| # --------------- Environment Setup (setup) --------------- |
| # Commands run on each node of the remote cluster to set up the environment (e.g., install dependencies). These are run directly inside Docker. |
| setup: | |
| rm -rf verl |
| git clone https://github.com/volcengine/verl.git |
| cd verl |
| pip3 install -v -e .[vllm] |
|
|
| # --------------- Run Command (run) --------------- |
| # The actual task commands to be executed on the remote cluster. |
| # This script will first start the Ray cluster (different ray start commands are executed on Head and Worker nodes). |
| # Then, your training script will only be run on the Head node (SKYPILOT_NODE_RANK == 0). |
| run: | |
| # Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot). |
| head_ip=`echo "$SKYPILOT_NODE_IPS" | head -n1` |
| num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l` # Here num_nodes should be equal to 2. |
|
|
| # login wandb |
| python3 -c "import wandb; wandb.login(relogin=True, key='$WANDB_API_KEY')" |
|
|
| # Start Ray based on node role (Head=0, Worker>0). |
| # This logic is a standard Ray cluster startup script. |
| if [ "$SKYPILOT_NODE_RANK" == "0" ]; then |
| # Head node starts Ray Head. |
| echo "Starting Ray head node..." |
| # Check if a Ray Head is already running to avoid duplicate starts. |
| ps aux | grep ray | grep 6379 &> /dev/null || ray start --head --disable-usage-stats \ |
| --port=6379 \ |
| --dashboard-host=0.0.0.0 \ |
| --dashboard-port=8265 |
|
|
| # Wait for all worker nodes to join the cluster. |
| while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do |
| echo "Waiting for all nodes to join... ($(ray nodes | grep NODE_ID | wc -l)/$num_nodes)" |
| sleep 5 |
| done |
|
|
| # Head node executes the training script. |
| echo "Executing training script on head node..." |
|
|
| python3 -m verl.trainer.main_ppo \ |
| data.train_files=data/gsm8k/train.parquet \ |
| data.val_files=data/gsm8k/test.parquet \ |
| data.train_batch_size=256 \ |
| data.max_prompt_length=512 \ |
| data.max_response_length=256 \ |
| actor_rollout_ref.model.path=Qwen/Qwen2.5-0.5B-Instruct \ |
| actor_rollout_ref.actor.optim.lr=1e-6 \ |
| actor_rollout_ref.actor.ppo_mini_batch_size=64 \ |
| actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ |
| actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ |
| actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ |
| actor_rollout_ref.rollout.name=vllm \ |
| actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ |
| actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ |
| critic.optim.lr=1e-5 \ |
| critic.model.path=Qwen/Qwen2.5-0.5B-Instruct \ |
| critic.ppo_micro_batch_size_per_gpu=4 \ |
| algorithm.kl_ctrl.kl_coef=0.001 \ |
| trainer.logger=['console','wandb'] \ |
| trainer.val_before_train=False \ |
| trainer.default_hdfs_dir=null \ |
| trainer.n_gpus_per_node=1 \ |
| trainer.nnodes=2 \ |
| trainer.save_freq=20 \ |
| trainer.test_freq=20 \ |
| trainer.total_epochs=2 \ |
| trainer.project_name=verl_examples \ |
| trainer.experiment_name=experiment_name_gsm8k |
|
|
| else |
| # Wait for Ray Head to start. |
| sleep 10 # Increase waiting time to ensure Head finishes starting. |
| # Worker node starts Ray Worker. |
| echo "Starting Ray worker node..." |
|
|
| # Check if a Ray Worker is already running to avoid duplicate starts. |
| ps aux | grep ray | grep $head_ip:6379 &> /dev/null || ray start --address $head_ip:6379 --disable-usage-stats |
|
|
| # Add sleep to after `ray start` to give ray enough time to daemonize |
| sleep 5 # Ensure Worker successfully connects to Head. |
| fi |
|
|
| # No commands are added to the Worker node here; the Worker's main task is to start Ray and wait for the Head node to assign tasks. |
| echo "Node setup and Ray start script finished for rank $SKYPILOT_NODE_RANK." |
|
|
|
|
| .. code-block:: bash |
|
|
| export WANDB_API_KEY=<your-wandb-api-key> |
| sky launch -c verl --secret WANDB_API_KEY verl-cluster.yml |
|
|
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/running_job.png?raw=true |
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/running_job_1.png?raw=true |
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/finished.png?raw=true |
|
|
| **Check the cluster on GCP** |
|
|
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/gcp_instances.png?raw=true |
|
|
| **Check Ray Dashboard** |
|
|
| We can see the cluster on the RAY Dashboard with the GCP head node: |
|
|
| ```console |
| $ sky status --endpoint 8265 verl |
| 1.2.3.4:8265 |
| ``` |
|
|
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/ray_dashboard_overview.png?raw=true |
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/ray_dashboard_jobs.png?raw=true |
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/ray_dashboard_cluster.png?raw=true |
|
|
|
|
| **Check the checkpoint of model** |
|
|
| .. code-block:: bash |
|
|
| # login the head node |
| ssh verl |
| # The global step will vary. Find the correct path from the training logs. |
| cd ~/sky_workdir/checkpoints/verl_examples/gsm8k/ |
| # Then list contents to find the checkpoint, e.g.: |
| ls -R . |
|
|
| .. image:: https://github.com/yottalabsai/open-source/blob/main/static/verl/saved_model.png?raw=true |
|
|
|
|
| Option 3: Launch via Slurm |
| ------------------------------ |
|
|
| Ray provides users with `this <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ official |
| tutorial to start a Ray cluster on top of Slurm. We have verified the :doc:`GSM8K example<../examples/gsm8k_example>` |
| on a Slurm cluster under a multi-node setting with the following steps. |
|
|
| 1. [Optional] If your cluster support `Apptainer or Singularity <https://apptainer.org/docs/user/main/>`_ and you wish |
| to use it, convert verl's Docker image to an Apptainer image. Alternatively, set up the environment with the package |
| manager available on your cluster or use other container runtimes (e.g. through `Slurm's OCI support <https://slurm.schedmd.com/containers.html>`_) available to you. |
|
|
| .. code:: bash |
|
|
| apptainer pull /your/dest/dir/vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3.sif docker://verlai/verl:vemlp-th2.4.0-cu124-vllm0.6.3-ray2.10-te1.7-v0.0.3 |
|
|
| 2. Follow :doc:`GSM8K example<../examples/gsm8k_example>` to prepare the dataset and model checkpoints. |
|
|
| 3. Modify `examples/slurm/ray_on_slurm.slurm <https://github.com/volcengine/verl/blob/main/examples/slurm/ray_on_slurm.slurm>`_ with your cluster's own information. |
|
|
| 4. Submit the job script to the Slurm cluster with `sbatch`. |
|
|
| Please note that Slurm cluster setup may vary. If you encounter any issues, please refer to Ray's |
| `Slurm user guide <https://docs.ray.io/en/latest/cluster/vms/user-guides/community/slurm.html>`_ for common caveats. |
|
|
| If you changed Slurm resource specifications, please make sure to update the environment variables in the job script if necessary. |
|
|
|
|
| Option 4: Launch via dstack |
| ------------------------------ |
|
|
| `dstackai/dstack <https://github.com/dstackai/dstack>`_ is an open-source container orchestrator that simplifies distributed training across cloud providers and on-premises environments |
| without the need to use K8S or Slurm. |
|
|
| Prerequisite |
| ~~~~~~~~~~~~ |
| Once dstack is `installed <https://dstack.ai/docs/installation>`_, initialize the directory as a repo with ``dstack init``. |
|
|
| .. code-block:: bash |
|
|
| mkdir myproject && cd myproject |
| dstack init |
|
|
| **Create a fleet** |
|
|
| Before submitting distributed training jobs, create a `dstack` `fleet <https://dstack.ai/docs/concepts/fleets>`_. |
|
|
| Run a Ray cluster task |
| ~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| Once the fleet is created, define a Ray cluster task, e.g. in ``ray-cluster.dstack.yml``: |
|
|
| .. code-block:: yaml |
|
|
| type: task |
| name: ray-verl-cluster |
|
|
| nodes: 2 |
|
|
| env: |
| - WANDB_API_KEY |
| - PYTHONUNBUFFERED=1 |
| - CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| |
| image: verlai/verl:app-verl0.6-transformers4.56.1-sglang0.5.2-mcore0.13.0-te2.2 |
| commands: |
| - git clone https://github.com/volcengine/verl |
| - cd verl |
| - pip install --no-deps -e . |
| - pip install hf_transfer hf_xet |
| - | |
| if [ $DSTACK_NODE_RANK = 0 ]; then |
| python3 examples/data_preprocess/gsm8k.py --local_save_dir ~/data/gsm8k |
| python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2.5-7B-Instruct')" |
| ray start --head --port=6379; |
| else |
| ray start --address=$DSTACK_MASTER_NODE_IP:6379 |
| fi |
|
|
| # Expose Ray dashboard port |
| ports: |
| - 8265 |
|
|
| resources: |
| gpu: 80GB:8 |
| shm_size: 128GB |
|
|
| # Save checkpoints on the instance |
| volumes: |
| - /checkpoints:/checkpoints |
|
|
| Now, if you run this task via `dstack apply`, it will automatically forward the Ray's dashboard port to `localhost:8265`. |
|
|
| .. code-block:: bash |
|
|
| dstack apply -f ray-cluster.dstack.yml |
|
|
| As long as the `dstack apply` is attached, you can use `localhost:8265` to submit Ray jobs for execution |
|
|
| Submit Ray jobs |
| ~~~~~~~~~~~~~~~ |
|
|
| Before you can submit Ray jobs, ensure to install `ray` locally: |
| |
| .. code-block:: shell |
|
|
| pip install ray |
|
|
| Now you can submit the training job to the Ray cluster which is available at ``localhost:8265``: |
| |
| .. code-block:: shell |
|
|
| $ RAY_ADDRESS=http://localhost:8265 |
| $ ray job submit \ |
| -- python3 -m verl.trainer.main_ppo \ |
| data.train_files=/root/data/gsm8k/train.parquet \ |
| data.val_files=/root/data/gsm8k/test.parquet \ |
| data.train_batch_size=256 \ |
| data.max_prompt_length=512 \ |
| data.max_response_length=256 \ |
| actor_rollout_ref.model.path=Qwen/Qwen2.5-7B-Instruct \ |
| actor_rollout_ref.actor.optim.lr=1e-6 \ |
| actor_rollout_ref.actor.ppo_mini_batch_size=64 \ |
| actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=4 \ |
| actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=8 \ |
| actor_rollout_ref.rollout.tensor_model_parallel_size=1 \ |
| actor_rollout_ref.rollout.gpu_memory_utilization=0.4 \ |
| actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=4 \ |
| critic.optim.lr=1e-5 \ |
| critic.model.path=Qwen/Qwen2.5-7B-Instruct \ |
| critic.ppo_micro_batch_size_per_gpu=4 \ |
| algorithm.kl_ctrl.kl_coef=0.001 \ |
| trainer.project_name=ppo_training \ |
| trainer.experiment_name=qwen-2.5-7B \ |
| trainer.val_before_train=False \ |
| trainer.n_gpus_per_node=8 \ |
| trainer.nnodes=2 \ |
| trainer.default_local_dir=/checkpoints \ |
| trainer.save_freq=10 \ |
| trainer.test_freq=10 \ |
| trainer.total_epochs=15 2>&1 | tee verl_demo.log \ |
| trainer.resume_mode=disable |
|
|
|
|
| For more details on how `dstack` works, check out its `documentation <https://dstack.ai/docs>`_. |
|
|
| How to debug? |
| --------------------- |
|
|
|
|
| Ray Distributed Debugger VSCode Extension (Recommended) |
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ |
|
|
| 1. Starting with Ray 2.39, Anyscale has introduced the `Ray Distributed Debugger <https://docs.ray.io/en/latest/ray-observability/ray-distributed-debugger.html>`_ VSCode extension. Follow the extension’s installation instructions, then add your cluster using the dashboard URL you obtained earlier. |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/debugger.png?raw=true |
| :alt: Ray Distributed Debugger VSCode extension screenshot |
|
|
| 2. Prerequisites. |
|
|
| Ensure the following are installed (see the extension README for more detail): |
|
|
| - Visual Studio Code |
| - `ray[default]` >= 2.9.1 |
| - `debugpy` >= 1.8.0 |
|
|
| .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/c7098b755ff689859837773a916c857.png?raw=true |
| :alt: VSCode with Ray prerequisites |
|
|
| 3. Environment Variables. |
|
|
| To enable post‑mortem debugging, set: |
|
|
| .. code-block:: bash |
|
|
| export RAY_DEBUG_POST_MORTEM=1 |
|
|
| .. admonition:: Note |
| :class: important |
|
|
| Be sure to remove any legacy flags before starting Ray: |
|
|
| - `RAY_DEBUG=legacy` |
| - `--ray-debugger-external` |
|
|
| 4. Configuring BreakpointsSet up breakpoint() in your code, and submit job to cluster. Then the extension will show the breakpoint information. |
|
|
|
|
| 1. Insert `breakpoint()` calls into your remote functions. |
| 2. Submit your job to the cluster. |
|
|
| The extension will detect active breakpoints and display them in VSCode. |
|
|
| .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true |
| :alt: Detected breakpoint in VSCode |
|
|
| **Note:** Breakpoints are only supported inside functions decorated with `@ray.remote`. |
|
|
| 5. Launching the Debugger. |
|
|
| Run your job directly from the command line (do not use a `launch.json`): |
|
|
| .. code-block:: bash |
|
|
| python job.py |
|
|
| 6. Attaching to a Breakpoint. |
|
|
| Once the process hits the first `breakpoint()`, click the Ray Distributed Debugger icon in the VSCode sidebar to attach the debugger. |
|
|
| .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/4ddad74395c79a1402331c0ce73316f.png?raw=true |
| :alt: Attaching VSCode debugger to Ray process |
|
|
| 7. Debugging With Multiple breakpoint(). |
|
|
| For each subsequent task, first disconnect the current debugger session, then click the extension icon again to attach to the next breakpoint. |
|
|
| .. image:: https://github.com/aoshen524/verl/blob/main/docs/start/6e83c910a62c82fecb89c6619e001cd.png?raw=true |
| :alt: Disconnecting and reconnecting the debugger |
|
|
| Legacy Ray Debugger |
| ~~~~~~~~~~~~~~~~~~~ |
| 1. Ray has a builtin legacy `debugger <https://docs.ray.io/en/latest/ray-observability/user-guides/debug-apps/ray-debugging.html>`_ that allows you to debug your distributed applications. To enable debugger, start ray cluster with ``RAY_DEBUG=legacy`` and ``--ray-debugger-external``. |
|
|
| .. code-block:: bash |
|
|
| # start head node |
| RAY_DEBUG=legacy ray start --head --dashboard-host=0.0.0.0 --ray-debugger-external |
| # start worker node |
| RAY_DEBUG=legacy ray start --address='10.124.46.192:6379' --ray-debugger-external |
|
|
| 2. Set up breakpoint in your code, and submit job to cluster. Then run ``ray debug`` to wait breakpoint: |
|
|
| .. image:: https://github.com/eric-haibin-lin/verl-community/blob/main/docs/ray/legacy.png?raw=true |
|
|
|
|
| Multi-node training on AMD clusters |
| --------------------------------------------------------------------------------------- |
|
|
| If you want to run multi-node training with slurm with Docker/Podman container on AMD Cluster, you can use the following script. |
|
|
| If you encounter any issues in using AMD GPUs running verl, please contact `Yusheng Su <https://yushengsu-thu.github.io/>`_. |
|
|
| .. note:: |
| 1. You need to use ``podman`` or ``docker`` in the following script. We will release the apptainer script later. |
| 2. If you want to use ``podman``, you just replace ``docker`` with ``podman`` in the following script. |
|
|
| The script includes the following steps: |
|
|
| 1. SLURM Configuration |
| 2. Environment Setup |
| 3. Docker/Podman Container Setup |
| 4. Ray Cluster Initialization |
| 5. Data Preprocessing |
| 6. Model Setup |
| 7. Training Launch |
|
|
|
|
| slurm_script.sh |
| ~~~~~~~~~~~~~~~~~~~~ |
|
|
| .. code-block:: bash |
|
|
| #!/bin/bash |
|
|
| #SBATCH --job-name=verl-ray-on-slurm |
| #SBATCH --nodes=2 |
| #SBATCH --ntasks-per-node=2 |
| #SBATCH --mem=200G |
| #SBATCH --time=30-00:00:00 |
| #SBATCH --gpus-per-node=8 |
| #SBATCH --cpus-per-task=28 |
| #SBATCH --output=../verl_log/slurm-%j.out |
| #SBATCH --error=../verl_log/slurm-%j.err |
| #SBATCH --nodelist=gpu-[0,1] |
|
|
|
|
| # load necessary modules |
| ### Run this setup |
| # [Cluster]: Use docker |
| # docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 |
|
|
|
|
| ########################################################################## |
| ###The following setting should be set in different project and cluster### |
| ########################################################################## |
|
|
| ### Project |
| CONTAINER_NAME="multinode_verl_training" |
| IMG="verl.rocm" |
| DOCKERFILE="docker/Dockerfile.rocm" |
| # echo $PWD |
| verl_workdir="${HOME}/projects/verl_upstream" |
| export TRANSFORMERS_CACHE="${HOME}/.cache/huggingface" |
| export HF_HOME=$TRANSFORMERS_CACHE |
|
|
| ### Cluster Network Setting |
| export NCCL_DEBUG=TRACE |
| export GPU_MAX_HW_QUEUES=2 |
| export TORCH_NCCL_HIGH_PRIORITY=1 |
| export NCCL_CHECKS_DISABLE=1 |
| # export NCCL_IB_HCA=rdma0,rdma1,rdma2,rdma3,rdma4,rdma5,rdma6,rdma7 |
| export NCCL_IB_HCA=mlx5_0,mlx5_1,mlx5_2,mlx5_3,mlx5_4,mlx5_5,mlx5_8,mlx5_9 |
| export NCCL_IB_GID_INDEX=3 |
| export NCCL_CROSS_NIC=0 |
| export CUDA_DEVICE_MAX_CONNECTIONS=1 |
| export NCCL_PROTO=Simple |
| export RCCL_MSCCL_ENABLE=0 |
| export TOKENIZERS_PARALLELISM=false |
| export HSA_NO_SCRATCH_RECLAIM=1 |
| ########################################################################## |
|
|
| ### For rocm and training script |
| export HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 |
| export ROCR_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES |
| export CUDA_VISIBLE_DEVICES=$HIP_VISIBLE_DEVICES |
|
|
|
|
| # Build and launch the Docker container |
| srun bash -c " |
| # Exit on any error |
| set -e |
|
|
| # Clean up dangling images (images with <none> tag) |
| docker image prune -f |
|
|
| # Need to pull the docker first |
| docker pull docker.io/rocm/vllm:rocm6.2_mi300_ubuntu20.04_py3.9_vllm_0.6.4 |
| |
| if ! docker images --format "{{.Repository}}:{{.Tag}}" | grep -q "${IMG}"; then |
| echo \"Building ${IMG} image...\" |
| docker build -f \"${DOCKERFILE}\" -t \"${IMG}\" . |
| else |
| echo \"${IMG} image already exists, skipping build\" |
| fi |
|
|
| # Removing old container if exists |
| docker rm \"${CONTAINER_NAME}\" 2>/dev/null || true |
|
|
| # Checking network devices |
| ibdev2netdev |
|
|
| # Launch the docker |
| docker run --rm -d \ |
| -e HYDRA_FULL_ERROR=1 \ |
| -e HIP_VISIBLE_DEVICES=${HIP_VISIBLE_DEVICES} \ |
| -e ROCR_VISIBLE_DEVICES=${ROCR_VISIBLE_DEVICES} \ |
| -e CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES} \ |
| -e NCCL_DEBUG=${NCCL_DEBUG} \ |
| -e GPU_MAX_HW_QUEUES=${GPU_MAX_HW_QUEUES} \ |
| -e TORCH_NCCL_HIGH_PRIORITY=${TORCH_NCCL_HIGH_PRIORITY} \ |
| -e NCCL_CHECKS_DISABLE=${NCCL_CHECKS_DISABLE} \ |
| -e NCCL_IB_HCA=${NCCL_IB_HCA} \ |
| -e NCCL_IB_GID_INDEX=${NCCL_IB_GID_INDEX} \ |
| -e NCCL_CROSS_NIC=${NCCL_CROSS_NIC} \ |
| -e CUDA_DEVICE_MAX_CONNECTIONS=${CUDA_DEVICE_MAX_CONNECTIONS} \ |
| -e NCCL_PROTO=${NCCL_PROTO} \ |
| -e RCCL_MSCCL_ENABLE=${RCCL_MSCCL_ENABLE} \ |
| -e TOKENIZERS_PARALLELISM=${TOKENIZERS_PARALLELISM} \ |
| -e HSA_NO_SCRATCH_RECLAIM=${HSA_NO_SCRATCH_RECLAIM} \ |
| -e TRANSFORMERS_CACHE=${TRANSFORMERS_CACHE} \ |
| -e HF_HOME=${HF_HOME} \ |
| --network host \ |
| --device /dev/dri \ |
| --device /dev/kfd \ |
| --device /dev/infiniband \ |
| --group-add video \ |
| --cap-add SYS_PTRACE \ |
| --security-opt seccomp=unconfined \ |
| --privileged \ |
| -v \${HOME}:\${HOME} \ |
| -v \${HOME}/.ssh:/root/.ssh \ |
| -w "${verl_workdir}" \ |
| --shm-size 128G \ |
| --name \"${CONTAINER_NAME}\" \ |
| \"${IMG}\" \ |
| tail -f /dev/null |
|
|
| echo \"Container setup completed\" |
| " |
| # (Optional): If you do not want to root mode and require assign yuorself as the user |
| # Please add `-e HOST_UID=$(id -u)` and `-e HOST_GID=$(id -g)` into the above docker launch script. |
|
|
|
|
|
|
|
|
|
|
| ### Ray launch the nodes before training |
|
|
| # Getting the node names |
| nodes_array=($(scontrol show hostnames "$SLURM_JOB_NODELIST" | tr '\n' ' ')) |
|
|
| head_node=${nodes_array[0]} |
| head_node_ip=$(srun --nodes=1 --ntasks=1 -w "$head_node" hostname --ip-address) |
|
|
| # if we detect a space character in the head node IP, we'll |
| # convert it to an ipv4 address. This step is optional. |
| if [[ "$head_node_ip" == *" "* ]]; then |
| IFS=' ' read -ra ADDR <<<"$head_node_ip" |
| if [[ ${#ADDR[0]} -gt 16 ]]; then |
| head_node_ip=${ADDR[1]} |
| else |
| head_node_ip=${ADDR[0]} |
| fi |
| echo "IPV6 address detected. We split the IPV4 address as $head_node_ip" |
| fi |
|
|
| port=6379 |
| ip_head=$head_node_ip:$port |
| export ip_head |
| echo "IP Head: $ip_head" |
|
|
| # make sure we set environment variables before Ray initialization |
|
|
| # Print out all env variables |
| printenv |
|
|
| echo "Starting HEAD at $head_node" |
| srun --nodes=1 --ntasks=1 -w "$head_node" \ |
| docker exec "${CONTAINER_NAME}" \ |
| ray start --head --node-ip-address="$head_node_ip" --port=$port \ |
| --dashboard-port=8266 \ |
| --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block & |
| # optional, though may be useful in certain versions of Ray < 1.0. |
| sleep 10 |
|
|
| # number of nodes other than the head node |
| worker_num=$((SLURM_JOB_NUM_NODES - 1)) |
|
|
| for ((i = 1; i <= worker_num; i++)); do |
| node_i=${nodes_array[$i]} |
| echo "Debug: Starting worker on node_i = ${node_i}" |
| if [ -z "$node_i" ]; then |
| echo "Error: Empty node name for worker $i" |
| continue |
| fi |
| echo "Starting WORKER $i at $node_i" |
| srun --nodes=1 --ntasks=1 -w "$node_i" \ |
| docker exec "${CONTAINER_NAME}" \ |
| ray start --address "$ip_head" --num-cpus "${SLURM_CPUS_PER_TASK}" --num-gpus "${SLURM_GPUS_PER_NODE}" --block & |
| sleep 5 |
| done |
|
|
|
|
|
|
|
|
| # Ray initlization test (See whether any error in the above execution) |
| echo "Testing Ray initialization in the slurm nodes..." |
| docker exec "${CONTAINER_NAME}" python3 -c ' |
| import ray |
| try: |
| ray.init(address="auto") |
| print("\n=== Ray Cluster Status ===") |
| print(f"Number of nodes: {len(ray.nodes())}") |
| for node in ray.nodes(): |
| print("Node: {}, Status: {}".format(node["NodeManagerHostname"], node["Alive"])) |
| # print(f"Node: {node}") |
| ray.shutdown() |
| print("Ray initialization successful!") |
| except Exception as e: |
| print(f"Ray initialization failed: {str(e)}") |
| ' |
| echo "=== Ray test completed ===" |
| ###### |
|
|
|
|
|
|
| # Run data preprocessing |
|
|
| echo "Starting data preprocessing..." |
| docker exec "${CONTAINER_NAME}" \ |
| python3 "examples/data_preprocess/gsm8k.py" "--local_save_dir" "../data/gsm8k" |
|
|
| echo "Starting data preprocessing..." |
| docker exec "${CONTAINER_NAME}" \ |
| python3 "examples/data_preprocess/math_dataset.py" "--local_dir" "../data/math" |
|
|
| train_files="../data/gsm8k/train.parquet" |
| val_files="../data/gsm8k/test.parquet" |
|
|
| # Download and test model |
| echo "Loading model..." |
| docker exec "${CONTAINER_NAME}" \ |
| python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')" |
| MODEL_PATH="Qwen/Qwen2-7B-Instruct" |
|
|
| # Set model path after pipeline test |
| MODEL_PATH="Qwen/Qwen2.5-0.5B-Instruct" |
|
|
| echo "== Data and model loading Done ==" |
|
|
| echo "Start to train..." |
|
|
| docker exec "${CONTAINER_NAME}" \ |
| python3 -c "import transformers; transformers.pipeline('text-generation', model='Qwen/Qwen2-7B-Instruct')" |
| MODEL_PATH="Qwen/Qwen2-7B-Instruct" |
|
|
|
|
| PYTHONUNBUFFERED=1 srun --overlap --nodes=${SLURM_NNODES} --ntasks=1 -w "$head_node" \ |
| docker exec "${CONTAINER_NAME}" \ |
| python3 -m verl.trainer.main_ppo \ |
| data.train_files=$train_files \ |
| data.val_files=$val_files \ |
| data.train_batch_size=1024 \ |
| data.max_prompt_length=1024 \ |
| data.max_response_length=1024 \ |
| actor_rollout_ref.model.path=$MODEL_PATH \ |
| actor_rollout_ref.model.enable_gradient_checkpointing=False \ |
| actor_rollout_ref.actor.optim.lr=1e-6 \ |
| actor_rollout_ref.model.use_remove_padding=True \ |
| actor_rollout_ref.actor.ppo_mini_batch_size=256 \ |
| actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=8 \ |
| actor_rollout_ref.model.enable_gradient_checkpointing=True \ |
| actor_rollout_ref.actor.fsdp_config.param_offload=False \ |
| actor_rollout_ref.actor.fsdp_config.optimizer_offload=False \ |
| actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=16 \ |
| actor_rollout_ref.rollout.tensor_model_parallel_size=2 \ |
| actor_rollout_ref.rollout.name=vllm \ |
| actor_rollout_ref.rollout.gpu_memory_utilization=0.9 \ |
| actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=16 \ |
| actor_rollout_ref.ref.fsdp_config.param_offload=True \ |
| critic.optim.lr=1e-5 \ |
| critic.model.use_remove_padding=True \ |
| critic.model.path=$MODEL_PATH \ |
| critic.model.enable_gradient_checkpointing=False \ |
| critic.ppo_micro_batch_size_per_gpu=8 \ |
| critic.model.fsdp_config.param_offload=False \ |
| critic.model.fsdp_config.optimizer_offload=False \ |
| algorithm.kl_ctrl.kl_coef=0.0001 \ |
| trainer.critic_warmup=0 \ |
| trainer.logger='["console","wandb"]' \ |
| trainer.project_name='verl_example' \ |
| trainer.experiment_name='Qwen2.5-32B-Instruct_function_rm' \ |
| trainer.n_gpus_per_node=${SLURM_GPUS_PER_NODE} \ |
| trainer.val_before_train=False \ |
| trainer.nnodes=${SLURM_NNODES} \ |
| trainer.save_freq=-1 \ |
| trainer.test_freq=10 \ |
| trainer.total_epochs=15 |
|
|
|
|
| Run multi-node training with above slurm_script.sh |
| ~~~~~~~~~~~~~~~~~~~~ |
| Just sbatch your slurm_script.sh |
|
|
| .. code-block:: bash |
|
|
| sbatch slurm_script.sh |
|
|
|
|