|
--- |
|
title: "Scaling Model Training with More Compute, How Do They Do It?" |
|
|
|
format: |
|
revealjs: |
|
theme: moon |
|
fig-format: png |
|
--- |
|
|
|
|
|
|
|
- Zachary Mueller |
|
- Technical Lead for the π€ Accelerate project |
|
- API design geek |
|
|
|
|
|
|
|
- We can somewhat estimate the memory usage in vanilla full-fine-tuning of models |
|
- Requires certain assumptions (that I'll be covering): |
|
- Adam optimizer |
|
- Batch size of 1 |
|
|
|
|
|
|
|
General estimate (`bert-base-cased`, 108M params): |
|
|
|
- Each parameter is 4 bytes |
|
- Backward ~= 2x the model size |
|
- The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer): |
|
|
|
::: {style="font-size: 50%;"} |
|
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest | |
|
|---------|:-----|:------:|:------:|:------:|:------:| |
|
| float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB | |
|
| float16 | 413.18 MB* | 619.77 MB | 826.36 MB | 826.36 MB | 826.36 MB | |
|
|
|
*All estimations were based off the [Model Estimator Tool](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) |
|
::: |
|
|
|
|
|
|
|
This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side). |
|
|
|
But what happens as we scale? |
|
|
|
Here's `llama-3-8B` (8.03B parameters) |
|
|
|
::: {style="font-size: 50%;"} |
|
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest | |
|
|---------|:-----|:------:|:------:|:------:|:------:| |
|
| float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB | |
|
| float16 | 28.21 GB* | 42.32 GB | 56.43 GB | 56.43 GB | 56.43 GB | |
|
::: |
|
Well, *I* don't have 56GB of GPU memory in a single card, let alone 112GB. |
|
|
|
What can we do? |
|
|
|
|
|
|
|
|
|
|
|
* Single GPU: |
|
* No distributed techniques at play |
|
* Distributed Data Parallelism (DDP): |
|
* A full copy of the model exists on each device, but data is chunked between each GPU |
|
* Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS): |
|
* Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs |
|
|
|
|
|
|
|
|
|
|
|
|
|
![](fsdp.png) |
|
|
|
:::{.notes} |
|
* Take the model and split it across `n` GPUs |
|
* Each GPU computes the shard's gradients |
|
* At the end, all gradients are synchronized and the final full model gradient is calculated |
|
* The backward pass can then be performed |
|
::: |
|
|
|
|
|
|
|
* Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs |
|
* These include how model weights are sharded, gradients, and more. |
|
* I'll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B *without PEFT* on 2x4090's |
|
|
|
|
|
|
|
* Dictates the level of divving resources to perform |
|
* `FULL_SHARD`: Includes optimizer states, gradients, and parameters |
|
* `SHARD_GRAD_OP`: Includes optimizer states and gradients |
|
* `NO_SHARD`: Normal DDP |
|
* `HYBRID_SHARD`: Includes optimizer states, gradients, and parameters but each node has the full model |
|
|
|
:::{.notes} |
|
FULL_SHARD: |
|
Parameters, Gradients, Optimizer States: All are sharded. |
|
Parameters Handling: Unshard before forward pass, reshard after forward pass, unshard before backward pass, reshard after backward pass. |
|
Gradients Handling: Synchronize and shard after backward pass. |
|
Optimizer States: Updated locally per rank. |
|
|
|
SHARD_GRAD_OP: |
|
Gradients and Optimizer States: Sharded during computation. |
|
Parameters: Unshard before forward pass, remain unsharded during forward pass, reshard after backward pass. |
|
Inside no_sync(): Parameters are not resharded after backward computation. |
|
Optimizer States: Updated locally per rank. |
|
|
|
NO_SHARD: |
|
Parameters, Gradients, Optimizer States: Not sharded, replicated across ranks. |
|
Gradients Handling: Synchronized via all-reduce after backward pass. |
|
Optimizer States: Updated locally per rank. |
|
|
|
HYBRID_SHARD: |
|
Parameters, Gradients, Optimizer States: Combines FULL_SHARD within a node and replicates parameters across nodes. |
|
Communication: Expensive operations like all-gathers and reduce-scatters are limited to within a node, enhancing performance for medium-sized models. |
|
::: |
|
|
|
|
|
|
|
* How the model should be split |
|
* Can be either `TRANSFORMER_BASED_WRAP` or `SIZE_BASED_WRAP` |
|
* `TRANSFORMER`/`fsdp_transformers_layer_cls_to_wrap`: |
|
* Need to declare the layer |
|
* Generally `transformers` has good defaults |
|
* `SIZE`/`fsdp_min_num_param`: |
|
* Number of total parameters in a shard |
|
|
|
|
|
* Offloads the parameters and gradients to the CPU if they can't fit into memory |
|
* Allows you to train much larger models locally, but will be much slower |
|
|
|
> Case: FFT of Llama-3-8B with `fsdp_offload_params` on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100 |
|
|
|
|
|
|
|
* Uses the idea behind big model inference/the `meta` device to load in the model to the GPU in a low-ram scenario |
|
* Rather than needing `model_size` * `n_gpus` RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via `sync_module_states` |
|
|
|
|
|
|
|
|
|
|
|
* So far we've covered the theory, but how do we put it into practice |
|
* By using a library that's at the heart of the entire open-source ecosystem |
|
|
|
::: {style="font-size: 60%;padding-left:10%;padding-top:0%;"} |
|
* Nearly all of π€ |
|
* `axolotl` |
|
* `fastai` |
|
* `FastChat` |
|
* `lucidrains` |
|
* `kornia` |
|
::: |
|
|
|
Are you using it and you don't even know? |
|
|
|
|
|
|
|
```{mermaid} |
|
%%| fig-height: 6 |
|
graph LR |
|
A(("π€ Accelerate#32;")) |
|
A --> B["CLI Interface#32;"] |
|
A --> C["Training Library#32;"] |
|
A --> D["Big Model<br>Inference#32;"] |
|
``` |
|
|
|
## A CLI Interface |
|
|
|
* `accelerate config` |
|
* Configure the environment |
|
* `accelerate estimate-memory` |
|
* How to guess vRAM requirements |
|
* `accelerate launch` |
|
* How to run your script |
|
|
|
## Launching distributed training is hard |
|
|
|
- ```bash |
|
python script.py |
|
``` |
|
|
|
- ```bash |
|
torchrun --nnodes=1 --nproc_per_node=2 script.py |
|
``` |
|
|
|
- ```bash |
|
deepspeed --num_gpus=2 script.py |
|
``` |
|
|
|
How can we make this better? |
|
|
|
## `accelerate launch` |
|
```bash |
|
accelerate launch script.py |
|
``` |
|
|
|
## `accelerate config` |
|
|
|
* Rely on `config.yaml` files |
|
* Choose to either running `accelerate config` or write your own: |
|
|
|
:::: {.columns style="font-size: 50%;padding-left:10%;"} |
|
::: {.column width="40%"} |
|
```{.yaml filename=ddp_config.yaml} |
|
compute_environment: LOCAL_MACHINE |
|
distributed_type: MULTI_GPU |
|
main_training_function: main |
|
mixed_precision: bf16 |
|
num_machines: 1 |
|
num_processes: 8 |
|
``` |
|
::: |
|
|
|
::: {.column width="40%"} |
|
```{.yaml filename=fsdp_config.yaml} |
|
compute_environment: LOCAL_MACHINE |
|
distributed_type: FSDP |
|
fsdp_config: |
|
fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP |
|
fsdp_backward_prefetch: BACKWARD_PRE |
|
fsdp_cpu_ram_efficient_loading: true |
|
fsdp_forward_prefetch: false |
|
fsdp_offload_params: false |
|
fsdp_sharding_strategy: FULL_SHARD |
|
fsdp_state_dict_type: SHARDED_STATE_DICT |
|
fsdp_sync_module_states: true |
|
fsdp_use_orig_params: false |
|
main_training_function: main |
|
mixed_precision: bf16 |
|
num_machines: 1 |
|
num_processes: 8 |
|
``` |
|
::: |
|
:::: |
|
|
|
# A Training Library |
|
|
|
## A Training Library: The Code |
|
|
|
:::: {.columns style="font-size: 50%;"} |
|
::: {.column} |
|
<br><br><br> |
|
```{.python code-line-numbers="5-6,9"} |
|
# For alignment purposes |
|
for batch in dataloader: |
|
optimizer.zero_grad() |
|
inputs, targets = batch |
|
inputs = inputs.to(device) |
|
targets = targets.to(device) |
|
outputs = model(inputs) |
|
loss = loss_function(outputs, targets) |
|
loss.backward() |
|
optimizer.step() |
|
scheduler.step() |
|
``` |
|
::: |
|
::: {.column} |
|
```{.python code-line-numbers="1-7,12-13,16"} |
|
from accelerate import Accelerator |
|
accelerator = Accelerator() |
|
dataloader, model, optimizer scheduler = ( |
|
accelerator.prepare( |
|
dataloader, model, optimizer, scheduler |
|
) |
|
) |
|
|
|
for batch in dataloader: |
|
optimizer.zero_grad() |
|
inputs, targets = batch |
|
# inputs = inputs.to(device) |
|
# targets = targets.to(device) |
|
outputs = model(inputs) |
|
loss = loss_function(outputs, targets) |
|
accelerator.backward(loss) # loss.backward() |
|
optimizer.step() |
|
scheduler.step() |
|
``` |
|
::: |
|
|
|
:::: |
|
|
|
## A Training Library: How Scaling Works |
|
|
|
* Accelerate's DataLoaders and schedulers work off of a sharding mindset |
|
* Rather than repeating the same data across `n` nodes, we instead split it |
|
* Speeds up training linearly |
|
* Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2 |
|
* This also means the scheduler will be stepped `n` GPUs at a time per "global step" |
|
|
|
## A Training Library: Mixed Precision |
|
|
|
* This may be a bit different than your "normal" idea of mixed precision. |
|
* We do **not** convert the model weights to BF16/FP16 |
|
* Instead we **wrap the forward pass** with `autocast` to convert the gradients automatically |
|
* This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on. |
|
* **If you use `.bf16()` weights, you are STUCK in bf16 perminantly** |
|
|
|
## A Training Library: Mixed Precision |
|
|
|
* Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine |
|
|
|
::: {style="font-size: 60%;"} |
|
| Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States | |
|
| -- | -- | -- | -- | -- | -- | -- | |
|
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 | |
|
| Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 | |
|
| MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 | |
|
| MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 | |
|
| MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 | |
|
::: |
|
|
|
:::{.notes} |
|
|
|
What is actually happening: |
|
* Linear Layers and other certain compatible layers are wrapped in a special version that allows for FP8 computation |
|
* The general forward pass is wrapped around BF16 |
|
* This means that the most memory saved is done during the gradients of the model, *not* the model itself. |
|
* With tools like `MS-AMP` we can convert more chunks into lower precision, but again like before stable training occurs when the models weights are in full precision and the backprop happens in full precision too. |
|
|
|
::: |
|
|
|
|
|
|
|
* Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation |
|
|
|
::: {style="font-size: 50%;"} |
|
Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local) |
|
--|--|--|--|--|-- |
|
FSDP | bf16 | default (none) | bf16 | bf16 | bf16 |
|
FSDP | bf16 | bf16 | fp32 | bf16 | fp32 |
|
DeepSpeed | bf16 | bf16 | fp32 | bf16 | fp32 |
|
::: |
|
|
|
To learn more, check out the [documentation](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) or join my office hours |
|
|
|
## Key Takeaways: |
|
|
|
* You can scale out training with `accelerate`, FSDP, and DeepSpeed across multiple GPUs to train bigger models |
|
* Techniques like `FP8` can help speed up training some and reduce computational overhead |
|
* Comes at a cost of end-precision and locking model weights for futher fine-tunes if not careful |
|
|
|
## Some Handy Resources |
|
|
|
- [π€ Accelerate documentation](https://hf.co/docs/accelerate) |
|
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch) |
|
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook) |
|
- [Migrating to π€ Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration) |
|
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) |
|
- [DeepSpeed and π€ Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) |
|
- [Fully Sharded Data Parallelism and π€ Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp) |
|
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) |