Accelerate documentation

Launching Multi-GPU Training from a Jupyter Environment

You are viewing v0.28.0 version. A newer version v1.2.1 is available.
Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Launching Multi-GPU Training from a Jupyter Environment

This tutorial teaches you how to fine tune a computer vision model with 🤗 Accelerate from a Jupyter Notebook on a distributed system. You will also learn how to setup a few requirements needed for ensuring your environment is configured properly, your data has been prepared properly, and finally how to launch training.

This tutorial is also available as a Jupyter Notebook here

Configuring the Environment

Before any training can be performed, a 🤗 Accelerate config file must exist in the system. Usually this can be done by running the following in a terminal and answering the prompts:

accelerate config

However, if general defaults are fine and you are not running on a TPU, 🤗Accelerate has a utility to quickly write your GPU configuration into a config file via utils.write_basic_config().

The following code will restart Jupyter after writing the configuration, as CUDA code was called to perform this.

CUDA can’t be initialized more than once on a multi-GPU system. It’s fine to debug in the notebook and have calls to CUDA, but in order to finally train a full cleanup and restart will need to be performed.

import os
from accelerate.utils import write_basic_config

write_basic_config()  # Write a config file
os._exit(00)  # Restart the notebook

Preparing the Dataset and Model

Next you should prepare your dataset. As mentioned at earlier, great care should be taken when preparing the DataLoaders and model to make sure that nothing is put on any GPU.

If you do, it is recommended to put that specific code into a function and call that from within the notebook launcher interface, which will be shown later.

Make sure the dataset is downloaded based on the directions here

import os, re, torch, PIL
import numpy as np

from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import DataLoader, Dataset
from torchvision.transforms import Compose, RandomResizedCrop, Resize, ToTensor

from accelerate import Accelerator
from accelerate.utils import set_seed
from timm import create_model

First you need to create a function to extract the class name based on a filename:

import os

data_dir = "../../images"
fnames = os.listdir(data_dir)
fname = fnames[0]
print(fname)
beagle_32.jpg

In the case here, the label is beagle. Using regex you can extract the label from the filename:

import re


def extract_label(fname):
    stem = fname.split(os.path.sep)[-1]
    return re.search(r"^(.*)_\d+\.jpg$", stem).groups()[0]
extract_label(fname)

And you can see it properly returned the right name for our file:

"beagle"

Next a Dataset class should be made to handle grabbing the image and the label:

class PetsDataset(Dataset):
    def __init__(self, file_names, image_transform=None, label_to_id=None):
        self.file_names = file_names
        self.image_transform = image_transform
        self.label_to_id = label_to_id

    def __len__(self):
        return len(self.file_names)

    def __getitem__(self, idx):
        fname = self.file_names[idx]
        raw_image = PIL.Image.open(fname)
        image = raw_image.convert("RGB")
        if self.image_transform is not None:
            image = self.image_transform(image)
        label = extract_label(fname)
        if self.label_to_id is not None:
            label = self.label_to_id[label]
        return {"image": image, "label": label}

Now to build the dataset. Outside the training function you can find and declare all the filenames and labels and use them as references inside the launched function:

fnames = [os.path.join("../../images", fname) for fname in fnames if fname.endswith(".jpg")]

Next gather all the labels:

all_labels = [extract_label(fname) for fname in fnames]
id_to_label = list(set(all_labels))
id_to_label.sort()
label_to_id = {lbl: i for i, lbl in enumerate(id_to_label)}

Next, you should make a get_dataloaders function that will return your built dataloaders for you. As mentioned earlier, if data is automatically sent to the GPU or a TPU device when building your DataLoaders, they must be built using this method.

def get_dataloaders(batch_size: int = 64):
    "Builds a set of dataloaders with a batch_size"
    random_perm = np.random.permutation(len(fnames))
    cut = int(0.8 * len(fnames))
    train_split = random_perm[:cut]
    eval_split = random_perm[cut:]

    # For training a simple RandomResizedCrop will be used
    train_tfm = Compose([RandomResizedCrop((224, 224), scale=(0.5, 1.0)), ToTensor()])
    train_dataset = PetsDataset([fnames[i] for i in train_split], image_transform=train_tfm, label_to_id=label_to_id)

    # For evaluation a deterministic Resize will be used
    eval_tfm = Compose([Resize((224, 224)), ToTensor()])
    eval_dataset = PetsDataset([fnames[i] for i in eval_split], image_transform=eval_tfm, label_to_id=label_to_id)

    # Instantiate dataloaders
    train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=batch_size, num_workers=4)
    eval_dataloader = DataLoader(eval_dataset, shuffle=False, batch_size=batch_size * 2, num_workers=4)
    return train_dataloader, eval_dataloader

Finally, you should import the scheduler to be used later:

from torch.optim.lr_scheduler import CosineAnnealingLR

Writing the Training Function

Now you can build the training loop. notebook_launcher() works by passing in a function to call that will be ran across the distributed system.

Here is a basic training loop for the animal classification problem:

The code has been split up to allow for explanations on each section. A full version that can be copy and pasted will be available at the end

def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
    set_seed(seed)
    accelerator = Accelerator(mixed_precision=mixed_precision)

First you should set the seed and create an Accelerator object as early in the training loop as possible.

If training on the TPU, your training loop should take in the model as a parameter and it should be instantiated outside of the training loop function. See the TPU best practices to learn why

Next you should build your dataloaders and create your model:

    train_dataloader, eval_dataloader = get_dataloaders(batch_size)
    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

You build the model here so that the seed also controls the new weight initialization

As you are performing transfer learning in this example, the encoder of the model starts out frozen so the head of the model can be trained only initially:

    for param in model.parameters():
        param.requires_grad = False
    for param in model.get_classifier().parameters():
        param.requires_grad = True

Normalizing the batches of images will make training a little faster:

    mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None]
    std = torch.tensor(model.default_cfg["std"])[None, :, None, None]

To make these constants available on the active device, you should set it to the Accelerator’s device:

    mean = mean.to(accelerator.device)
    std = std.to(accelerator.device)

Next instantiate the rest of the PyTorch classes used for training:

    optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25)
    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=3e-2, epochs=5, steps_per_epoch=len(train_dataloader))

Before passing everything to prepare().

There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the prepare method.

    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

Now train the model:

    for epoch in range(5):
        model.train()
        for batch in train_dataloader:
            inputs = (batch["image"] - mean) / std
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

The evaluation loop will look slightly different compared to the training loop. The number of elements passed as well as the overall total accuracy of each batch will be added to two constants:

        model.eval()
        accurate = 0
        num_elems = 0

Next you have the rest of your standard PyTorch loop:

        for batch in eval_dataloader:
            inputs = (batch["image"] - mean) / std
            with torch.no_grad():
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)

Before finally the last major difference.

When performing distributed evaluation, the predictions and labels need to be passed through gather() so that all of the data is available on the current device and a properly calculated metric can be achieved:

            accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"])
            num_elems += accurate_preds.shape[0]
            accurate += accurate_preds.long().sum()

Now you just need to calculate the actual metric for this problem, and you can print it on the main process using print():

        eval_metric = accurate.item() / num_elems
        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

A full version of this training loop is available below:

def training_loop(mixed_precision="fp16", seed: int = 42, batch_size: int = 64):
    set_seed(seed)
    # Initialize accelerator
    accelerator = Accelerator(mixed_precision=mixed_precision)
    # Build dataloaders
    train_dataloader, eval_dataloader = get_dataloaders(batch_size)

    # Instantiate the model (you build the model here so that the seed also controls new weight initaliziations)
    model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

    # Freeze the base model
    for param in model.parameters():
        param.requires_grad = False
    for param in model.get_classifier().parameters():
        param.requires_grad = True

    # You can normalize the batches of images to be a bit faster
    mean = torch.tensor(model.default_cfg["mean"])[None, :, None, None]
    std = torch.tensor(model.default_cfg["std"])[None, :, None, None]

    # To make these constants available on the active device, set it to the accelerator device
    mean = mean.to(accelerator.device)
    std = std.to(accelerator.device)

    # Instantiate the optimizer
    optimizer = torch.optim.Adam(params=model.parameters(), lr=3e-2 / 25)

    # Instantiate the learning rate scheduler
    lr_scheduler = OneCycleLR(optimizer=optimizer, max_lr=3e-2, epochs=5, steps_per_epoch=len(train_dataloader))

    # Prepare everything
    # There is no specific order to remember, you just need to unpack the objects in the same order you gave them to the
    # prepare method.
    model, optimizer, train_dataloader, eval_dataloader, lr_scheduler = accelerator.prepare(
        model, optimizer, train_dataloader, eval_dataloader, lr_scheduler
    )

    # Now you train the model
    for epoch in range(5):
        model.train()
        for batch in train_dataloader:
            inputs = (batch["image"] - mean) / std
            outputs = model(inputs)
            loss = torch.nn.functional.cross_entropy(outputs, batch["label"])
            accelerator.backward(loss)
            optimizer.step()
            lr_scheduler.step()
            optimizer.zero_grad()

        model.eval()
        accurate = 0
        num_elems = 0
        for batch in eval_dataloader:
            inputs = (batch["image"] - mean) / std
            with torch.no_grad():
                outputs = model(inputs)
            predictions = outputs.argmax(dim=-1)
            accurate_preds = accelerator.gather(predictions) == accelerator.gather(batch["label"])
            num_elems += accurate_preds.shape[0]
            accurate += accurate_preds.long().sum()

        eval_metric = accurate.item() / num_elems
        # Use accelerator.print to print only on the main process.
        accelerator.print(f"epoch {epoch}: {100 * eval_metric:.2f}")

Using the notebook_launcher

All that’s left is to use the notebook_launcher().

You pass in the function, the arguments (as a tuple), and the number of processes to train on. (See the documentation for more information)

from accelerate import notebook_launcher
args = ("fp16", 42, 64)
notebook_launcher(training_loop, args, num_processes=2)

In the case of running on multiple nodes, you need to set up a Jupyter session at each node and run the launching cell at the same time.

For an environment containing 2 nodes (computers) with 8 GPUs each and the main computer with an IP address of “172.31.43.8”, it would look like so:

notebook_launcher(training_loop, args, master_addr="172.31.43.8", node_rank=0, num_nodes=2, num_processes=8)

And in the second Jupyter session on the other machine:

Notice how the node_rank has changed

notebook_launcher(training_loop, args, master_addr="172.31.43.8", node_rank=1, num_nodes=2, num_processes=8)

In the case of running on the TPU, it would look like so:

model = create_model("resnet50d", pretrained=True, num_classes=len(label_to_id))

args = (model, "fp16", 42, 64)
notebook_launcher(training_loop, args, num_processes=8)

As it’s running it will print the progress as well as state how many devices you ran on. This tutorial was ran with two GPUs:

Launching training on 2 GPUs.
epoch 0: 88.12
epoch 1: 91.73
epoch 2: 92.58
epoch 3: 93.90
epoch 4: 94.71

And that’s it!

Debugging

A common issue when running the notebook_launcher is receiving a CUDA has already been initialized issue. This usually stems from an import or prior code in the notebook that makes a call to the PyTorch torch.cuda sublibrary. To help narrow down what went wrong, you can launch the notebook_launcher with ACCELERATE_DEBUG_MODE=yes in your environment and an additional check will be made when spawning that a regular process can be created and utilize CUDA without issue. (Your CUDA code can still be ran afterwards).

Conclusion

This notebook showed how to perform distributed training from inside of a Jupyter Notebook. Some key notes to remember:

  • Make sure to save any code that use CUDA (or CUDA imports) for the function passed to notebook_launcher()
  • Set the num_processes to be the number of devices used for training (such as number of GPUs, CPUs, TPUs, etc)
  • If using the TPU, declare your model outside the training loop function