Train with a script
Along with the 🤗 Transformers notebooks, there are also example scripts demonstrating how to train a model for a task with PyTorch, TensorFlow, or JAX/Flax.
You will also find scripts we’ve used in our research projects and legacy examples which are mostly community contributed. These scripts are not actively maintained and require a specific version of 🤗 Transformers that will most likely be incompatible with the latest version of the library.
The example scripts are not expected to work out-of-the-box on every problem, and you may need to adapt the script to the problem you’re trying to solve. To help you with this, most of the scripts fully expose how data is preprocessed, allowing you to edit it as necessary for your use case.
For any feature you’d like to implement in an example script, please discuss it on the forum or in an issue before submitting a Pull Request. While we welcome bug fixes, it is unlikely we will merge a Pull Request that adds more functionality at the cost of readability.
This guide will show you how to run an example summarization training script in PyTorch and TensorFlow. All examples are expected to work with both frameworks unless otherwise specified.
Setup
To successfully run the latest version of the example scripts, you have to install 🤗 Transformers from source in a new virtual environment:
git clone https://github.com/huggingface/transformers
cd transformers
pip install .
For older versions of the example scripts, click on the toggle below:
Examples for older versions of 🤗 Transformers
Then switch your current clone of 🤗 Transformers to a specific version, like v3.5.1 for example:
git checkout tags/v3.5.1
After you’ve setup the correct library version, navigate to the example folder of your choice and install the example specific requirements:
pip install -r requirements.txt
Run a script
The example script downloads and preprocesses a dataset from the 🤗 Datasets library. Then the script fine-tunes a dataset with the Trainer on an architecture that supports summarization. The following example shows how to fine-tune T5-small on the CNN/DailyMail dataset. The T5 model requires an additional source_prefix
argument due to how it was trained. This prompt lets T5 know this is a summarization task.
python examples/pytorch/summarization/run_summarization.py \
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
The example script downloads and preprocesses a dataset from the 🤗 Datasets library. Then the script fine-tunes a dataset using Keras on an architecture that supports summarization. The following example shows how to fine-tune T5-small on the CNN/DailyMail dataset. The T5 model requires an additional source_prefix
argument due to how it was trained. This prompt lets T5 know this is a summarization task.
python examples/tensorflow/summarization/run_summarization.py \
--model_name_or_path google-t5/t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 16 \
--num_train_epochs 3 \
--do_train \
--do_eval
Distributed training and mixed precision
The Trainer supports distributed training and mixed precision, which means you can also use it in a script. To enable both of these features:
- Add the
fp16
argument to enable mixed precision. - Set the number of GPUs to use with the
nproc_per_node
argument.
torchrun \
--nproc_per_node 8 pytorch/summarization/run_summarization.py \
--fp16 \
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
TensorFlow scripts utilize a MirroredStrategy
for distributed training, and you don’t need to add any additional arguments to the training script. The TensorFlow script will use multiple GPUs by default if they are available.
Run a script on a TPU
Tensor Processing Units (TPUs) are specifically designed to accelerate performance. PyTorch supports TPUs with the XLA deep learning compiler (see here for more details). To use a TPU, launch the xla_spawn.py
script and use the num_cores
argument to set the number of TPU cores you want to use.
python xla_spawn.py --num_cores 8 \
summarization/run_summarization.py \
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
Tensor Processing Units (TPUs) are specifically designed to accelerate performance. TensorFlow scripts utilize a TPUStrategy
for training on TPUs. To use a TPU, pass the name of the TPU resource to the tpu
argument.
python run_summarization.py \
--tpu name_of_tpu_resource \
--model_name_or_path google-t5/t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 16 \
--num_train_epochs 3 \
--do_train \
--do_eval
Run a script with 🤗 Accelerate
🤗 Accelerate is a PyTorch-only library that offers a unified method for training a model on several types of setups (CPU-only, multiple GPUs, TPUs) while maintaining complete visibility into the PyTorch training loop. Make sure you have 🤗 Accelerate installed if you don’t already have it:
Note: As Accelerate is rapidly developing, the git version of accelerate must be installed to run the scripts
pip install git+https://github.com/huggingface/accelerate
Instead of the run_summarization.py
script, you need to use the run_summarization_no_trainer.py
script. 🤗 Accelerate supported scripts will have a task_no_trainer.py
file in the folder. Begin by running the following command to create and save a configuration file:
accelerate config
Test your setup to make sure it is configured correctly:
accelerate test
Now you are ready to launch the training:
accelerate launch run_summarization_no_trainer.py \
--model_name_or_path google-t5/t5-small \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir ~/tmp/tst-summarization
Use a custom dataset
The summarization script supports custom datasets as long as they are a CSV or JSON Line file. When you use your own dataset, you need to specify several additional arguments:
train_file
andvalidation_file
specify the path to your training and validation files.text_column
is the input text to summarize.summary_column
is the target text to output.
A summarization script using a custom dataset would look like this:
python examples/pytorch/summarization/run_summarization.py \
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--train_file path_to_csv_or_jsonlines_file \
--validation_file path_to_csv_or_jsonlines_file \
--text_column text_column_name \
--summary_column summary_column_name \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--overwrite_output_dir \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--predict_with_generate
Test a script
It is often a good idea to run your script on a smaller number of dataset examples to ensure everything works as expected before committing to an entire dataset which may take hours to complete. Use the following arguments to truncate the dataset to a maximum number of samples:
max_train_samples
max_eval_samples
max_predict_samples
python examples/pytorch/summarization/run_summarization.py \
--model_name_or_path google-t5/t5-small \
--max_train_samples 50 \
--max_eval_samples 50 \
--max_predict_samples 50 \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate
Not all example scripts support the max_predict_samples
argument. If you aren’t sure whether your script supports this argument, add the -h
argument to check:
examples/pytorch/summarization/run_summarization.py -h
Resume training from checkpoint
Another helpful option to enable is resuming training from a previous checkpoint. This will ensure you can pick up where you left off without starting over if your training gets interrupted. There are two methods to resume training from a checkpoint.
The first method uses the output_dir previous_output_dir
argument to resume training from the latest checkpoint stored in output_dir
. In this case, you should remove overwrite_output_dir
:
python examples/pytorch/summarization/run_summarization.py
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--output_dir previous_output_dir \
--predict_with_generate
The second method uses the resume_from_checkpoint path_to_specific_checkpoint
argument to resume training from a specific checkpoint folder.
python examples/pytorch/summarization/run_summarization.py
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--resume_from_checkpoint path_to_specific_checkpoint \
--predict_with_generate
Share your model
All scripts can upload your final model to the Model Hub. Make sure you are logged into Hugging Face before you begin:
huggingface-cli login
Then add the push_to_hub
argument to the script. This argument will create a repository with your Hugging Face username and the folder name specified in output_dir
.
To give your repository a specific name, use the push_to_hub_model_id
argument to add it. The repository will be automatically listed under your namespace.
The following example shows how to upload a model with a specific repository name:
python examples/pytorch/summarization/run_summarization.py
--model_name_or_path google-t5/t5-small \
--do_train \
--do_eval \
--dataset_name cnn_dailymail \
--dataset_config "3.0.0" \
--source_prefix "summarize: " \
--push_to_hub \
--push_to_hub_model_id finetuned-t5-cnn_dailymail \
--output_dir /tmp/tst-summarization \
--per_device_train_batch_size=4 \
--per_device_eval_batch_size=4 \
--overwrite_output_dir \
--predict_with_generate