File size: 12,457 Bytes
c9e768e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b68830
c9e768e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2b68830
c9e768e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
---
title: "Hugging Face Accelerate: Making device-agnostic ML training and inference easy at scale"

format: 
    revealjs:
        theme: moon
        fig-format: png
        auto-play-media: true
---

## Who am I?

- Zachary Mueller
- Technical Lead for the πŸ€— Accelerate project
- Maintain the `transformers` Trainer
- API design geek

## What is πŸ€— Accelerate?

* A training framework
* An inference framework
* A command-line interface

## A Training Framework

* Powered by PyTorch
* Change a few lines of code, gain device *and* hardware-agnostic capabilities
* Low-code, with minimal magic aimed at easy hackability and use without high-level abstractions
* We handle the intracies so you don't have to

## A Training Framework

::: {style="font-size: 70%;"}

* Support for any hardware-accelerator on the market:
  * CPU, GPU, TPU, XPU, NPU, MLU
* Automatic mixed-precision training *safely* in whatever fashion you may choose:
  * FP16, BF16, FP8 (through either `TransformerEngine` or `MS-AMP`)
* Automatic and efficient gradient accumulation
* Support for quantization through `bitsandbytes`
* Support your favorite experiment trackers (`aim`, `clearml`, `comet_ml`, `dvc-lite`, `ml-flow`, `tensorboard`, `wandb`)
* Easy to configure plugin or YAML-level API for setting up advanced frameworks like `FSDP`, `DeepSpeed`, and `Megatron-LM`
:::

## Low-Code 

::: {style="font-size: 70%;"}
* Biggest friction with "wrapper" libraries is control of your code
* By being minimally intrusive, your code just "works" while still giving you complete control
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
- device = 'cpu'
+ device = accelerator.device

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
-         loss.backward()
+         accelerator.backward(loss)
          optimizer.step()
```
:::

## Easy to integrate

::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
  1. Create an `Accelerator`
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
+ from accelerate import Accelerator

+ accelerator = Accelerator()
  device = 'cpu'

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
          loss.backward()
          optimizer.step()
```
:::

## Easy to integrate

::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
  2. Wrap your PyTorch objects with `accelerator.prepare` and remove device-placements
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
  from accelerate import Accelerator

  accelerator = Accelerator()
- device = 'cpu'

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

+ model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
          loss.backward()
          optimizer.step()
```
:::

## Easy to integrate

::: {style="font-size: 70%;"}
* Due to the low-code nature, it's trivial to integrate into existing PyTorch frameworks:
  3. Use `accelerator.backward` for the backward pass
:::

::: {style="font-size: 60%;padding-left:15%;padding-top:0%;padding-right:20%"}
```diff
  import torch
  import torch.nn.functional as F
  from datasets import load_dataset
  from accelerate import Accelerator

  accelerator = Accelerator()

  model = torch.nn.Transformer().to(device)
  optimizer = torch.optim.Adam(model.parameters())
  dataset = load_dataset('my_dataset')
  data = torch.utils.data.DataLoader(dataset, shuffle=True)

  model, optimizer, dataloader = accelerator.prepare(model, optimizer, dataloader)

  model.train()
  for epoch in range(10):
      for source, targets in dataloader:
          source, targets = source.to(device), targets.to(device)
          optimizer.zero_grad()
          output = model(source)
          loss = F.cross_entropy(output, targets)
-         loss.backward()
+         accelerator.backward(loss)
          optimizer.step()
```
:::


## But what about inference?

* πŸ€— Accelerate is not just for training, and has helped make the GPU-Poor take control of the narrative
* Using tools like Big Model Inference, users with *tiny* compute can run large models locally
* Started with the boom of stable diffusion, and now has scaled to having the ability to run huge LLMs locally with a single graphics card

## How does it work?

* PyTorch introduced `device="meta"`
* πŸ€— Accelerate introduced `device_map="auto"`

::: {style="padding-left:15%;padding-right:20%"}
{{< video big_model_visualization.mp4 width="800" height="400" >}}
:::

## A CLI Interface
* `accelerate config`
  * Configure the environment
* `accelerate launch`
  * How to run your script

## Launching distributed training is hard

::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash 
python script.py
```
:::
::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
vs.
:::
<br>

::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash 
torchrun --nnodes=1 --nproc_per_node=2 script.py
```
:::
::: {style="padding-left:50%;padding-bottom:0%;padding-top:0%;"}
vs.
:::
<br>

::: {style="padding-top:0%;padding-left:10%;padding-right:15%;padding-bottom:0%"}
```bash 
deepspeed --num_gpus=2 script.py
```
<br>
:::
How can we make this better?


## `accelerate launch`

::: {style="padding-top:0%;padding-left:5%;padding-right:10%;padding-bottom:0%"}
```bash
accelerate launch script.py
```

<br>

```bash
accelerate launch --multi_gpu --num_processes 2 script.py
```

<br>

```bash
accelerate launch \
  --multi_gpu \ 
  --use_deepspeed \
  --num_processes 2 \
  script.py
```
:::

## `accelerate config`
* Rely on `config.yaml` files
* Choose to either running `accelerate config` or write your own:

:::: {.columns style="font-size: 60%;padding-left:5%;padding-right:5%"}
::: {.column width="40%"}
```{.yaml filename=ddp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::: {.column width="40%"}
```{.yaml filename=fsdp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::::

# Now that you're up to speed, what's new?

# We've had a busy last year, and so has the ML Community!

## New training techniques
- Quantization has taken the field by storm
- New ideas such as FSDP + QLoRA to train huge models on tiny compute!
- New precision backends as we train natively on smaller precision
- Optimizing futher how much we can push on a single machine through efficient RAM and timing techniques

## Larger compute landscape
- As we search for alternatives to NVIDIA, new compilers rise:
  - XPU (Intel)
  - NPU (Intel)
  - MLU (Cambricon)

All of which are supported by πŸ€— Accelerate


## Lower abstractions
* While the `Accelerator` was great, needed better abstractions focused on controlling behaviors
* Introduced the `PartialState`

::: {style="padding-left:10%;padding-top:0%;padding-right:15%"}
```python
from accelerate import PartialState

if PartialState().is_main_process:
  # Run on only 1 device

with PartialState().main_process_first:
  # Useful for dataset processing

# Device-agnostic without the bulk of the `Accelerator`
device = PartialState().device
```
:::

## Faster and better inference alternatives
::: {style="font-size:70%"}
- `PiPPy` gives us efficient pipeline-parallelism in distributed environments to increase throughput while keeping a simple torch-bound API
- Rather than having to wait for each GPU, every GPU can be busy in parallel
- Will be critical as larger LLMs take hold and more than one computer is needed
:::
::: {style="font-size:60%;padding-left:19%;padding-top:0%;padding-right:24%;"}
```python
import torch
from transformers import AutoModelForSequenceClassification

from accelerate import PartialState, prepare_pippy

model = AutoModelForSequenceClassification.from_pretrained("gpt2")
model.eval()

input = torch.randint(
    low=0,
    high=model.config.vocab_size,
    size=(2, 1024),  # bs x seq_len
    device="cpu",
)

model = prepare_pippy(model, split_points="auto", example_args=(input,))

with torch.no_grad():
    output = model(input)
```
:::


# Adoption: Accelerate in the ecosystem

## Accelerate in the Ecosystem
* Many of the frameworks you use daily already rely on πŸ€— Accelerate!
  * Nearly all of πŸ€—
  * `axolotl`
  * `fastai`
  * `FastChat`
  * `lucidrains`
  * `kornia`


## Accelerate in the Ecosystem
::: {style="font-size: 70%;"}
- Started as a way to isolate out distributed code on TPU and `DistributedDataParallelism`
:::

::: {style="padding-left: 30%"}
![](sylvain_tweet.JPG){width="70%"}
:::

## Accelerate in the Ecosystem
::: {style="font-size: 70%;"}
- Now is the backbone of some of the largest PyTorch training frameworks in the ecosystem
:::
::: {style="padding-left: 30%;"}
![](hf_trainer.JPG){width="70%"}
:::

# What's next?

# Elevating the community

* Now that more advanced training techniques are reachable (FSDP, DeepSpeed, etc), we need to focus on educating the community on how to use it best
* Goes beyond how to use the `Trainer` or `Accelerator`, but how to use *what* where
* Keep Accelerate as a tool for the community to utilize when new techniques come out and play with, to push new ideas to scale quickly

# 1.0.0: Soon!

* Tried and battle-tested by over 7M users/month | 110M+ total downloads
* As we've been stable for over a year now, we're near ready to release 1.0.0

# Thanks for joining!

::: {style="font-size: 70%;"}

- [πŸ€— Accelerate documentation](https://hf.co/docs/accelerate)
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
- [Migrating to πŸ€— Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
- [DeepSpeed and πŸ€— Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
- [Fully Sharded Data Parallelism and πŸ€— Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)
:::