Spaces:

muellerzr
/

ai-dev-talk

Running

File size: 12,457 Bytes

---
title: "Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale"

format: 
    revealjs:
        theme: moon
        fig-format: png
        auto-play-media: true
---

## Who am I?

- Zachary Mueller
- Technical Lead for the 🤗 Accelerate project
- Maintain the `transformers` Trainer
- API design geek

## What is 🤗 Accelerate?

* A training framework
* An inference framework
* A command-line interface

## A Training Framework

* Powered by PyTorch
* Change a few lines of code, gain device *and* hardware-agnostic capabilities
* Low-code, with minimal magic aimed at easy hackability and use without high-level abstractions
* We handle the intracies so you don't have to

## A Training Framework

::: {style="font-size: 70%;"}

* Support for any hardware-accelerator on the market:
  * CPU, GPU, TPU, XPU, NPU, MLU
* Automatic mixed-precision training *safely* in whatever fashion you may choose:
  * FP16, BF16, FP8 (through either `TransformerEngine` or `MS-AMP`)
* Automatic and efficient gradient accumulation
* Support for quantization through `bitsandbytes`
* Support your favorite experiment trackers (`aim`, `clearml`, `comet_ml`, `dvc-lite`, `ml-flow`, `tensorboard`, `wandb`)
* Easy to configure plugin or YAML-level API for setting up advanced frameworks like `FSDP`, `DeepSpeed`, and `Megatron-LM`
:::

## Low-Code 

::: {style="font-size: 70%;"}
* Biggest friction with "wrapper" libraries is control of your code
* By being minimally intrusive, your code just "works" while still giving you complete control
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
-         loss.backward()
+         accelerator.backward(loss)
          optimizer.step()
```
:::

## Easy to integrate

::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
  1. Create an `Accelerator`
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
  device = 'cpu'

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
          loss.backward()
          optimizer.step()
```
:::

## Easy to integrate

::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
  2. Wrap your PyTorch objects with `accelerator.prepare` and remove device-placements
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
  from accelerate import Accelerator

  accelerator = Accelerator()
- device = 'cpu'

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
          loss.backward()
          optimizer.step()
```
:::

## Easy to integrate

::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
  3. Use `accelerator.backward` for the backward pass
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
  from accelerate import Accelerator

  accelerator = Accelerator()

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

  model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
-         loss.backward()
+         accelerator.backward(loss)
          optimizer.step()
```
:::


## But what about inference?

* 🤗 Accelerate is not just for training, and has helped make the GPU-Poor take control of the narrative
* Using tools like Big Model Inference, users with *tiny* compute can run large models locally
* Started with the boom of stable diffusion, and now has scaled to having the ability to run huge LLMs locally with a single graphics card

## How does it work?

* PyTorch introduced `device="meta"`
* 🤗 Accelerate introduced `device_map="auto"`

::: {style="padding-left:15%;padding-right:20%"}
{{< video big_model_visualization.mp4 width="800" height="400" >}}
:::

## A CLI Interface
* `accelerate config`
  * Configure the environment
* `accelerate launch`
  * How to run your script

## Launching distributed training is hard

::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash 
python script.py
```
:::
::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
vs.
:::
<br>

::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash 
torchrun --nnodes=1 --nproc_per_node=2 script.py
```
:::
::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
vs.
:::
<br>

::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash 
deepspeed --num_gpus=2 script.py
```
<br>
:::
How can we make this better?


## `accelerate launch`

::: {style="padding-top:0%;padding-left:5%;padding-right:10%;padding-bottom:0%"}
```bash
accelerate launch script.py
```

<br>

```bash
accelerate launch --multi_gpu --num_processes 2 script.py
```

<br>

```bash
accelerate launch \
  --multi_gpu \ 
  --use_deepspeed \
  --num_processes 2 \
  script.py
```
:::

## `accelerate config`
* Rely on `config.yaml` files
* Choose to either running `accelerate config` or write your own:

:::: {.columns style="font-size: 60%;padding-left:5%;padding-right:5%"}
::: {.column width="40%"}
```{.yaml filename=ddp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::: {.column width="40%"}
```{.yaml filename=fsdp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::::

# Now that you're up to speed, what's new?

# We've had a busy last year, and so has the ML Community!

## New training techniques
- Quantization has taken the field by storm
- New ideas such as FSDP + QLoRA to train huge models on tiny compute!
- New precision backends as we train natively on smaller precision
- Optimizing futher how much we can push on a single machine through efficient RAM and timing techniques

## Larger compute landscape
- As we search for alternatives to NVIDIA, new compilers rise:
  - XPU (Intel)
  - NPU (Intel)
  - MLU (Cambricon)

All of which are supported by 🤗 Accelerate


## Lower abstractions
* While the `Accelerator` was great, needed better abstractions focused on controlling behaviors
* Introduced the `PartialState`

::: {style="padding-left:10%;padding-top:0%;padding-right:15%"}
```python
from accelerate import PartialState

if PartialState().is_main_process:
  # Run on only 1 device

with PartialState().main_process_first:
  # Useful for dataset processing

# Device-agnostic without the bulk of the `Accelerator`
device = PartialState().device
```
:::

## Faster and better inference alternatives
::: {style="font-size:70%"}
- `PiPPy` gives us efficient pipeline-parallelism in distributed environments to increase throughput while keeping a simple torch-bound API
- Rather than having to wait for each GPU, every GPU can be busy in parallel
- Will be critical as larger LLMs take hold and more than one computer is needed
:::
::: {style="font-size:60%;padding-left:19%;padding-top:0%;padding-right:24%;"}
```python
import torch
from transformers import AutoModelForSequenceClassification

from accelerate import PartialState, prepare_pippy

model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model.eval()

input = torch.randint(
    low=0,
    high=model.config.vocab_size,
    size=(2, 1024),  # bs x seq_len
    device="cpu",
)

model = prepare_pippy(model, split_points="auto", example_args=(input,))

with torch.no_grad():
    output = model(input)
```
:::


# Adoption: Accelerate in the ecosystem

## Accelerate in the Ecosystem
* Many of the frameworks you use daily already rely on 🤗 Accelerate!
  * Nearly all of 🤗
  * `axolotl`
  * `fastai`
  * `FastChat`
  * `lucidrains`
  * `kornia`


## Accelerate in the Ecosystem
::: {style="font-size: 70%;"}
- Started as a way to isolate out distributed code on TPU and `DistributedDataParallelism`
:::

::: {style="padding-left: 30%"}
![](sylvain_tweet.JPG){width="70%"}
:::

## Accelerate in the Ecosystem
::: {style="font-size: 70%;"}
- Now is the backbone of some of the largest PyTorch training frameworks in the ecosystem
:::
::: {style="padding-left: 30%;"}
![](hf_trainer.JPG){width="70%"}
:::

# What's next?

# Elevating the community

* Now that more advanced training techniques are reachable (FSDP, DeepSpeed, etc), we need to focus on educating the community on how to use it best
* Goes beyond how to use the `Trainer` or `Accelerator`, but how to use *what* where
* Keep Accelerate as a tool for the community to utilize when new techniques come out and play with, to push new ideas to scale quickly

# 1.0.0: Soon!

* Tried and battle-tested by over 7M users/month | 110M+ total downloads
* As we've been stable for over a year now, we're near ready to release 1.0.0

# Thanks for joining!

::: {style="font-size: 70%;"}

- [🤗 Accelerate documentation](https://hf.co/docs/accelerate)
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
- [Migrating to 🤗 Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
- [DeepSpeed and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
- [Fully Sharded Data Parallelism and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)
:::