ai-dev-talk / accelerate.qmd
muellerzr's picture
muellerzr HF staff
Move
2b68830
raw
history blame
No virus
12.5 kB
---
title: "Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale"
format:
revealjs:
theme: moon
fig-format: png
auto-play-media: true
---
## Who am I?
- Zachary Mueller
- Technical Lead for the πŸ€— Accelerate project
- Maintain the `transformers` Trainer
- API design geek
## What is πŸ€— Accelerate?
* A training framework
* An inference framework
* A command-line interface
## A Training Framework
* Powered by PyTorch
* Change a few lines of code, gain device *and* hardware-agnostic capabilities
* Low-code, with minimal magic aimed at easy hackability and use without high-level abstractions
* We handle the intracies so you don't have to
## A Training Framework
::: {style="font-size: 70%;"}
* Support for any hardware-accelerator on the market:
* CPU, GPU, TPU, XPU, NPU, MLU
* Automatic mixed-precision training *safely* in whatever fashion you may choose:
* FP16, BF16, FP8 (through either `TransformerEngine` or `MS-AMP`)
* Automatic and efficient gradient accumulation
* Support for quantization through `bitsandbytes`
* Support your favorite experiment trackers (`aim`, `clearml`, `comet_ml`, `dvc-lite`, `ml-flow`, `tensorboard`, `wandb`)
* Easy to configure plugin or YAML-level API for setting up advanced frameworks like `FSDP`, `DeepSpeed`, and `Megatron-LM`
:::
## Low-Code
::: {style="font-size: 70%;"}
* Biggest friction with "wrapper" libraries is control of your code
* By being minimally intrusive, your code just "works" while still giving you complete control
:::
::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
import torch
import torch.nn.functional as F
from datasets import load_dataset
+ from accelerate import Accelerator
+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device
model = torch.nn.Transformer().to(device)
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
model.train()
for epoch in range(10):
for source, targets in dataloader:
source, targets = source.to(device), targets.to(device)
optimizer.zero_grad()
output = model(source)
loss = F.cross_entropy(output, targets)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
```
:::
## Easy to integrate
::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
1. Create an `Accelerator`
:::
::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
import torch
import torch.nn.functional as F
from datasets import load_dataset
+ from accelerate import Accelerator
+ accelerator = Accelerator()
device = 'cpu'
model = torch.nn.Transformer().to(device)
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)
model.train()
for epoch in range(10):
for source, targets in dataloader:
source, targets = source.to(device), targets.to(device)
optimizer.zero_grad()
output = model(source)
loss = F.cross_entropy(output, targets)
loss.backward()
optimizer.step()
```
:::
## Easy to integrate
::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
2. Wrap your PyTorch objects with `accelerator.prepare` and remove device-placements
:::
::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
import torch
import torch.nn.functional as F
from datasets import load_dataset
from accelerate import Accelerator
accelerator = Accelerator()
- device = 'cpu'
model = torch.nn.Transformer().to(device)
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)
+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
model.train()
for epoch in range(10):
for source, targets in dataloader:
source, targets = source.to(device), targets.to(device)
optimizer.zero_grad()
output = model(source)
loss = F.cross_entropy(output, targets)
loss.backward()
optimizer.step()
```
:::
## Easy to integrate
::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
3. Use `accelerator.backward` for the backward pass
:::
::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
import torch
import torch.nn.functional as F
from datasets import load_dataset
from accelerate import Accelerator
accelerator = Accelerator()
model = torch.nn.Transformer().to(device)
optimizer = torch.optim.Adam(model.parameters())
dataset = load_dataset('my_dataset')
data = torch.utils.data.DataLoader(dataset, shuffle=True)
model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)
model.train()
for epoch in range(10):
for source, targets in dataloader:
source, targets = source.to(device), targets.to(device)
optimizer.zero_grad()
output = model(source)
loss = F.cross_entropy(output, targets)
- loss.backward()
+ accelerator.backward(loss)
optimizer.step()
```
:::
## But what about inference?
* πŸ€— Accelerate is not just for training, and has helped make the GPU-Poor take control of the narrative
* Using tools like Big Model Inference, users with *tiny* compute can run large models locally
* Started with the boom of stable diffusion, and now has scaled to having the ability to run huge LLMs locally with a single graphics card
## How does it work?
* PyTorch introduced `device="meta"`
* πŸ€— Accelerate introduced `device_map="auto"`
::: {style="padding-left:15%;padding-right:20%"}
{{< video big_model_visualization.mp4 width="800" height="400" >}}
:::
## A CLI Interface
* `accelerate config`
* Configure the environment
* `accelerate launch`
* How to run your script
## Launching distributed training is hard
::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash
python script.py
```
:::
::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
vs.
:::
<br>
::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash
torchrun --nnodes=1 --nproc_per_node=2 script.py
```
:::
::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
vs.
:::
<br>
::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash
deepspeed --num_gpus=2 script.py
```
<br>
:::
How can we make this better?
## `accelerate launch`
::: {style="padding-top:0%;padding-left:5%;padding-right:10%;padding-bottom:0%"}
```bash
accelerate launch script.py
```
<br>
```bash
accelerate launch --multi_gpu --num_processes 2 script.py
```
<br>
```bash
accelerate launch \
--multi_gpu \
--use_deepspeed \
--num_processes 2 \
script.py
```
:::
## `accelerate config`
* Rely on `config.yaml` files
* Choose to either running `accelerate config` or write your own:
:::: {.columns style="font-size: 60%;padding-left:5%;padding-right:5%"}
::: {.column width="40%"}
```{.yaml filename=ddp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::: {.column width="40%"}
```{.yaml filename=fsdp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
fsdp_backward_prefetch: BACKWARD_PRE
fsdp_cpu_ram_efficient_loading: true
fsdp_forward_prefetch: false
fsdp_offload_params: false
fsdp_sharding_strategy: FULL_SHARD
fsdp_state_dict_type: SHARDED_STATE_DICT
fsdp_sync_module_states: true
fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::::
# Now that you're up to speed, what's new?
# We've had a busy last year, and so has the ML Community!
## New training techniques
- Quantization has taken the field by storm
- New ideas such as FSDP + QLoRA to train huge models on tiny compute!
- New precision backends as we train natively on smaller precision
- Optimizing futher how much we can push on a single machine through efficient RAM and timing techniques
## Larger compute landscape
- As we search for alternatives to NVIDIA, new compilers rise:
- XPU (Intel)
- NPU (Intel)
- MLU (Cambricon)
All of which are supported by πŸ€— Accelerate
## Lower abstractions
* While the `Accelerator` was great, needed better abstractions focused on controlling behaviors
* Introduced the `PartialState`
::: {style="padding-left:10%;padding-top:0%;padding-right:15%"}
```python
from accelerate import PartialState
if PartialState().is_main_process:
# Run on only 1 device
with PartialState().main_process_first:
# Useful for dataset processing
# Device-agnostic without the bulk of the `Accelerator`
device = PartialState().device
```
:::
## Faster and better inference alternatives
::: {style="font-size:70%"}
- `PiPPy` gives us efficient pipeline-parallelism in distributed environments to increase throughput while keeping a simple torch-bound API
- Rather than having to wait for each GPU, every GPU can be busy in parallel
- Will be critical as larger LLMs take hold and more than one computer is needed
:::
::: {style="font-size:60%;padding-left:19%;padding-top:0%;padding-right:24%;"}
```python
import torch
from transformers import AutoModelForSequenceClassification
from accelerate import PartialState, prepare_pippy
model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model.eval()
input = torch.randint(
low=0,
high=model.config.vocab_size,
size=(2, 1024), # bs x seq_len
device="cpu",
)
model = prepare_pippy(model, split_points="auto", example_args=(input,))
with torch.no_grad():
output = model(input)
```
:::
# Adoption: Accelerate in the ecosystem
## Accelerate in the Ecosystem
* Many of the frameworks you use daily already rely on πŸ€— Accelerate!
* Nearly all of πŸ€—
* `axolotl`
* `fastai`
* `FastChat`
* `lucidrains`
* `kornia`
## Accelerate in the Ecosystem
::: {style="font-size: 70%;"}
- Started as a way to isolate out distributed code on TPU and `DistributedDataParallelism`
:::
::: {style="padding-left: 30%"}
![](sylvain_tweet.JPG){width="70%"}
:::
## Accelerate in the Ecosystem
::: {style="font-size: 70%;"}
- Now is the backbone of some of the largest PyTorch training frameworks in the ecosystem
:::
::: {style="padding-left: 30%;"}
![](hf_trainer.JPG){width="70%"}
:::
# What's next?
# Elevating the community
* Now that more advanced training techniques are reachable (FSDP, DeepSpeed, etc), we need to focus on educating the community on how to use it best
* Goes beyond how to use the `Trainer` or `Accelerator`, but how to use *what* where
* Keep Accelerate as a tool for the community to utilize when new techniques come out and play with, to push new ideas to scale quickly
# 1.0.0: Soon!
* Tried and battle-tested by over 7M users/month | 110M+ total downloads
* As we've been stable for over a year now, we're near ready to release 1.0.0
# Thanks for joining!
::: {style="font-size: 70%;"}
- [πŸ€— Accelerate documentation](https://hf.co/docs/accelerate)
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
- [Migrating to πŸ€— Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
- [DeepSpeed and πŸ€— Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
- [Fully Sharded Data Parallelism and πŸ€— Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)
:::