PyTorch Custom Containers GPU Template

Overview

The directory provides code to fine tune a transformer model (BERT-base) from Huggingface Transformers Library for sentiment analysis task. BERT (Bidirectional Encoder Representations from Transformers) is a transformers model pre-trained on a large corpus of unlabeled text in a self-supervised fashion. In this sample, we use IMDB sentiment classification dataset for the task. We show you packaging a PyTorch training model to submit it to Vertex AI using pre-built PyTorch containers and handling Python dependencies using Vertex Training custom containers.

Prerequisites

Setup your project by following the instructions from documentation
Setup docker with Cloud Container Registry
Change the directory to this sample and run

Note: These instructions are used for local testing. When you submit a training job, no code will be executed on your local machine.

Directory Structure

trainer directory: all Python modules to train the model.
scripts directory: command-line scripts to train the model on Vertex AI.
setup.py: setup.py scripts specifies Python dependencies required for the training job. Vertex Training uses pip to install the package on the training instances allocated for the job.

Trainer Modules

File Name	Purpose
metadata.py	Defines: metadata for classification task such as predefined model dataset name, target labels.
utils.py	Includes: utility functions such as data input functions to read data, save model to GCS bucket.
model.py	Includes: function to create model with a sequence classification head from a pretrained model.
experiment.py	Runs the model training and evaluation experiment, and exports the final model.
task.py	Includes: 1) Initialize and parse task arguments (hyper parameters), and 2) Entry point to the trainer.

Scripts

train-cloud.sh This script builds your Docker image locally, pushes the image to Container Registry and submits a custom container training job to Vertex AI.

Please read the documentation on Vertex Training with Custom Containers for more details.

How to run

Once the prerequisites are satisfied, you may:

For local testing, run (refer notebook for instructions):

CUSTOM_TRAIN_IMAGE_URI='gcr.io/{PROJECT_ID}/pytorch_gpu_train_{APP_NAME}'
cd ./custom_container/ && docker build -f Dockerfile -t $CUSTOM_TRAIN_IMAGE_URI ../python_package
docker run --gpus all -it --rm $CUSTOM_TRAIN_IMAGE_URI

For cloud testing, run:
```
source ./scripts/train-cloud.sh
```

Run on GPU

The provided trainer code runs on a GPU if one is available including data loading and model creation.

To run the trainer code on a different GPU configuration or latest PyTorch pre-built container image, make the following changes to the trainer script.

Update the PyTorch image URI to one of PyTorch pre-built containers
Update the worker-pool-spec in the gcloud command that includes a GPU

Then, run the script to submit a Custom Job on Vertex Training job:

source ./scripts/train-cloud.sh

Versions

This script uses the pre-built PyTorch containers for PyTorch 1.7.

us-docker.pkg.dev/vertex-ai/training/pytorch-gpu.1-7:latest