---
title: "Scaling Model Training with More Compute, How Do They Do It?"

format: 
    revealjs:
        theme: moon
        fig-format: png
---

## Who am I?

- Zachary Mueller
- Technical Lead for the 🤗 Accelerate project
- API design geek

## Understanding GPU Usage

- We can somewhat estimate the memory usage in vanilla full-fine-tuning of models
- Requires certain assumptions (that I'll be covering):
    - Adam optimizer
    - Batch size of 1

## Understanding GPU Usage

General estimate (`bert-base-cased`, 108M params):

- Each parameter is 4 bytes
- Backward ~= 2x the model size
- The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer):

::: {style="font-size: 50%;"}
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|---------|:-----|:------:|:------:|:------:|:------:|
| float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB |
| float16 | 413.18 MB* | 619.77 MB | 826.36 MB | 826.36 MB | 826.36 MB |

*All estimations were based off the [Model Estimator Tool](https://huggingface.co/spaces/hf-accelerate/model-memory-usage)
:::

## Understanding GPU Usage

This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side).

But what happens as we scale?

Here's `llama-3-8B` (8.03B parameters)

::: {style="font-size: 50%;"}
| dtype | Model | Gradients | Backward pass | Optimizer step | Highest |
|---------|:-----|:------:|:------:|:------:|:------:|
| float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB |
| float16 | 28.21 GB* | 42.32 GB | 56.43 GB | 56.43 GB | 56.43 GB |
:::
Well, *I* don't have 56GB of GPU memory in a single card, let alone 112GB.

What can we do?

# Distributed Training

## Kinds of Training

* Single GPU:
  * No distributed techniques at play
* Distributed Data Parallelism (DDP): 
  * A full copy of the model exists on each device, but data is chunked between each GPU
* Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS):
  * Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs


# Fully Sharded Data Parallelism

## Fully Sharded Data Parallelism

![](fsdp.png)

:::{.notes}
* Take the model and split it across `n` GPUs
* Each GPU computes the shard's gradients
* At the end, all gradients are synchronized and the final full model gradient is calculated
* The backward pass can then be performed
:::

## FSDP: Getting parameter specific

* Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs
* These include how model weights are sharded, gradients, and more.
* I'll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B *without PEFT* on 2x4090's

## `sharding_strategy`

* Dictates the level of divving resources to perform
  * `FULL_SHARD`: Includes optimizer states, gradients, and parameters
  * `SHARD_GRAD_OP`: Includes optimizer states and gradients
  * `NO_SHARD`: Normal DDP
  * `HYBRID_SHARD`: Includes optimizer states, gradients, and parameters but each node has the full model

  :::{.notes}
      FULL_SHARD:
        Parameters, Gradients, Optimizer States: All are sharded.
        Parameters Handling: Unshard before forward pass, reshard after forward pass, unshard before backward pass, reshard after backward pass.
        Gradients Handling: Synchronize and shard after backward pass.
        Optimizer States: Updated locally per rank.

    SHARD_GRAD_OP:
        Gradients and Optimizer States: Sharded during computation.
        Parameters: Unshard before forward pass, remain unsharded during forward pass, reshard after backward pass.
        Inside no_sync(): Parameters are not resharded after backward computation.
        Optimizer States: Updated locally per rank.

    NO_SHARD:
        Parameters, Gradients, Optimizer States: Not sharded, replicated across ranks.
        Gradients Handling: Synchronized via all-reduce after backward pass.
        Optimizer States: Updated locally per rank.

    HYBRID_SHARD:
        Parameters, Gradients, Optimizer States: Combines FULL_SHARD within a node and replicates parameters across nodes.
        Communication: Expensive operations like all-gathers and reduce-scatters are limited to within a node, enhancing performance for medium-sized models.
:::

## `auto_wrap_policy`:

* How the model should be split
* Can be either `TRANSFORMER_BASED_WRAP` or `SIZE_BASED_WRAP`
* `TRANSFORMER`/`fsdp_transformers_layer_cls_to_wrap`:
    * Need to declare the layer
    * Generally `transformers` has good defaults
* `SIZE`/`fsdp_min_num_param`:
    * Number of total parameters in a shard

## `offload_params`:
* Offloads the parameters and gradients to the CPU if they can't fit into memory
* Allows you to train much larger models locally, but will be much slower

> Case: FFT of Llama-3-8B with `fsdp_offload_params` on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100

## `cpu_ram_efficient_loading` and `sync_module_states`

* Uses the idea behind big model inference/the `meta` device to load in the model to the GPU in a low-ram scenario
* Rather than needing `model_size` * `n_gpus` RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via `sync_module_states`

# Tying this to 🤗 Accelerate

## Tying this to 🤗 Accelerate

* So far we've covered the theory, but how do we put it into practice
* By using a library that's at the heart of the entire open-source ecosystem

::: {style="font-size: 60%;padding-left:10%;padding-top:0%;"}
* Nearly all of 🤗
* `axolotl`
* `fastai`
* `FastChat`
* `lucidrains`
* `kornia`
:::

Are you using it and you don't even know?

## What is 🤗 Accelerate?

```{mermaid}
%%| fig-height: 6
graph LR
    A(("🤗 Accelerate#32;"))
    A --> B["CLI Interface#32;"]
    A --> C["Training Library#32;"]
    A --> D["Big Model<br>Inference#32;"]
```

## A CLI Interface

* `accelerate config`
  * Configure the environment
* `accelerate estimate-memory`
  * How to guess vRAM requirements
* `accelerate launch`
  * How to run your script

## Launching distributed training is hard

- ```bash 
python script.py
```

- ```bash 
torchrun --nnodes=1 --nproc_per_node=2 script.py
```

- ```bash 
deepspeed --num_gpus=2 script.py
```

How can we make this better?

## `accelerate launch`
```bash
accelerate launch script.py
```

## `accelerate config`

* Rely on `config.yaml` files
* Choose to either running `accelerate config` or write your own:

:::: {.columns style="font-size: 50%;padding-left:10%;"}
::: {.column width="40%"}
```{.yaml filename=ddp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: MULTI_GPU
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::

::: {.column width="40%"}
```{.yaml filename=fsdp_config.yaml}
compute_environment: LOCAL_MACHINE
distributed_type: FSDP
fsdp_config:
  fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP
  fsdp_backward_prefetch: BACKWARD_PRE
  fsdp_cpu_ram_efficient_loading: true
  fsdp_forward_prefetch: false
  fsdp_offload_params: false
  fsdp_sharding_strategy: FULL_SHARD
  fsdp_state_dict_type: SHARDED_STATE_DICT
  fsdp_sync_module_states: true
  fsdp_use_orig_params: false
main_training_function: main
mixed_precision: bf16
num_machines: 1
num_processes: 8
```
:::
::::

# A Training Library

## A Training Library: The Code

:::: {.columns style="font-size: 50%;"}
::: {.column}
<br><br><br>
```{.python code-line-numbers="5-6,9"}
# For alignment purposes
for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    inputs = inputs.to(device)
    targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    loss.backward()
    optimizer.step()
    scheduler.step()
```
:::
::: {.column}
```{.python code-line-numbers="1-7,12-13,16"}
from accelerate import Accelerator
accelerator = Accelerator()
dataloader, model, optimizer scheduler = (
    accelerator.prepare(
        dataloader, model, optimizer, scheduler
    )
)

for batch in dataloader:
    optimizer.zero_grad()
    inputs, targets = batch
    # inputs = inputs.to(device)
    # targets = targets.to(device)
    outputs = model(inputs)
    loss = loss_function(outputs, targets)
    accelerator.backward(loss) # loss.backward()
    optimizer.step()
    scheduler.step()
```
:::

::::

## A Training Library: How Scaling Works

* Accelerate's DataLoaders and schedulers work off of a sharding mindset
* Rather than repeating the same data across `n` nodes, we instead split it
* Speeds up training linearly
* Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2
* This also means the scheduler will be stepped `n` GPUs at a time per "global step"

## A Training Library: Mixed Precision

* This may be a bit different than your "normal" idea of mixed precision.
* We do **not** convert the model weights to BF16/FP16
* Instead we **wrap the forward pass** with `autocast` to convert the gradients automatically
* This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on.
* **If you use `.bf16()` weights, you are STUCK in bf16 perminantly**

## A Training Library: Mixed Precision

* Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine

::: {style="font-size: 60%;"}
| Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States |
| -- | -- | -- | -- | -- | -- | -- |
| FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
| Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 |
| MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 |
| MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 |
| MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 |
:::

:::{.notes}

What is actually happening:
* Linear Layers and other certain compatible layers are wrapped in a special version that allows for FP8 computation
* The general forward pass is wrapped around BF16
* This means that the most memory saved is done during the gradients of the model, *not* the model itself. 
* With tools like `MS-AMP` we can convert more chunks into lower precision, but again like before stable training occurs when the models weights are in full precision and the backprop happens in full precision too. 

:::

## DeepSpeed vs Fully Sharded Data Parallelism

* Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation

::: {style="font-size: 50%;"}
Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local)
--|--|--|--|--|--
FSDP | bf16 | default (none) | bf16 | bf16 | bf16
FSDP | bf16 | bf16 | fp32 | bf16 | fp32
DeepSpeed   | bf16 | bf16 | fp32 | bf16 | fp32
:::

To learn more, check out the [documentation](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) or join my office hours

## Key Takeaways:

* You can scale out training with `accelerate`, FSDP, and DeepSpeed across multiple GPUs to train bigger models
* Techniques like `FP8` can help speed up training some and reduce computational overhead
* Comes at a cost of end-precision and locking model weights for futher fine-tunes if not careful

## Some Handy Resources

- [🤗 Accelerate documentation](https://hf.co/docs/accelerate)
- [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch)
- [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook)
- [Migrating to 🤗 Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration)
- [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling)
- [DeepSpeed and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed)
- [Fully Sharded Data Parallelism and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp)
- [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)