Accelerated PyTorch Training on Mac
With PyTorch v1.12 release, developers and researchers can take advantage of Apple silicon GPUs for significantly faster model training.
This unlocks the ability to perform machine learning workflows like prototyping and fine-tuning locally, right on Mac.
Apple’s Metal Performance Shaders (MPS) as a backend for PyTorch enables this and can be used via the new "mps"
device.
This will map computational graphs and primitives on the MPS Graph framework and tuned kernels provided by MPS.
For more information please refer official documents Introducing Accelerated PyTorch Training on Mac
and MPS BACKEND.
Benefits of Training and Inference using Apple M1 Chips
- Enables users to train larger networks or batch sizes locally
- Reduces data retrieval latency and provides the GPU with direct access to the full memory store due to unified memory architecture. Therefore, improving end-to-end performance.
- Reduces costs associated with cloud-based development or the need for additional local GPUs.
Pre-requisites: To install torch with mps support, please follow this nice medium article GPU-Acceleration Comes to PyTorch on M1 Macs.
How it works out of the box
On your machine(s) just run:
accelerate config
and answer the questions asked, specifically choose MPS
for the query:
Which type of machine are you using?.
This will generate a config file that will be used automatically to properly set
the default options when doing accelerate launch
, such as the one shown below:
compute_environment: LOCAL_MACHINE
deepspeed_config: {}
distributed_type: MPS
downcast_bf16: 'no'
fsdp_config: {}
machine_rank: 0
main_process_ip: null
main_process_port: null
main_training_function: main
mixed_precision: 'no'
num_machines: 1
num_processes: 1
use_cpu: false
After this configuration has been made, here is how you run the CV example (from the root of the repo) with MPS enabled:
accelerate launch /examples/cv_example.py --data_dir images
A few caveats to be aware of
- For
nlp_example.py
the metrics are too bad when compared to CPU-only training. This means certain operations in BERT model are going wrong using mps device and this needs to be fixed by PyTorch. - Distributed setups
gloo
andnccl
are not working withmps
device. This means that currently only single GPU ofmps
device type can be used.
Finally, please, remember that, 🤗 Accelerate
only integrates MPS backend, therefore if you
have any problems or questions with regards to MPS backend usage, please, file an issue with PyTorch GitHub.