Running benchmarks on multiple GPU nodes with Pegasus

Pegasus is an SSH-based multi-node command runner. Different models have different verbosity, and benchmarking takes vastly different amounts of time. Therefore, we want an automated piece of software that drains a queue of benchmarking jobs (one job per model) on a set of GPUs.

Setup

Install Pegasus

Pegasus needs to keep SSH connections with all the nodes in order to queue up and run jobs over SSH. So you should install and run Pegasus on a computer that you can keep awake.

If you already have Rust set up:

$ cargo install pegasus-ssh

Otherwise, you can set up Rust here, or just download Pegasus release binaries here.

Necessary setup for each node

Every node must have two things:

This repository cloned under ~/workspace/leaderboard.

If you want a different path, search and replace in spawn-containers.yaml.

Model weights under /data/leaderboard/weights.

If you want a different path, search and replace in setupspawn-containers.yaml and benchmark.yaml.

Specify node names for Pegasus

Modify hosts.yaml with nodes. See the file for an example.

hostname: List the hostnames you would use in order to ssh into the node, e.g. jaywonchung@gpunode01.
gpu: We want to create one Docker container for each GPU. List the indices of the GPUs you would like to use for the hosts.

Set up Docker containers on your nodes with Pegasus

This spawns one container per GPU (named leaderboard%d), for every node.

$ cd pegasus
$ cp spawn-containers.yaml queue.yaml
$ pegasus b

b stands for broadcast. Every command is run once on all (hostname, gpu) combinations.

Benchmark

Now use Pegasus to run benchmarks for all the models across all nodes.

$ cd pegasus
$ cp benchmark.yaml queue.yaml
$ pegasus q

q stands for queue. Each command is run once on the next available (hostname, gpu) combination.