--- title: "Scaling Model Training with More Compute, How Do They Do It?" format: revealjs: theme: moon fig-format: png --- ## Who am I? - Zachary Mueller - Technical Lead for the 🤗 Accelerate project - API design geek ## Understanding GPU Usage - We can somewhat estimate the memory usage in vanilla full-fine-tuning of models - Requires certain assumptions (that I'll be covering): - Adam optimizer - Batch size of 1 ## Understanding GPU Usage General estimate (`bert-base-cased`, 108M params): - Each parameter is 4 bytes - Backward ~= 2x the model size - The optimizer step ~= 4x the model size (1x model, 1x gradients, 2x optimizer): ::: {style="font-size: 50%;"} | dtype | Model | Gradients | Backward pass | Optimizer step | Highest | |---------|:-----|:------:|:------:|:------:|:------:| | float32 | 413.18 MB | 413.18 MB | 826.36 MB | 1.61 GB | 1.61 GB | | float16 | 413.18 MB* | 619.77 MB | 826.36 MB | 826.36 MB | 826.36 MB | *All estimations were based off the [Model Estimator Tool](https://huggingface.co/spaces/hf-accelerate/model-memory-usage) ::: ## Understanding GPU Usage This works fine for small models, we have cards with anywhere from 12-24GB of GPU memory (on the GPU-poor side). But what happens as we scale? Here's `llama-3-8B` (8.03B parameters) ::: {style="font-size: 50%;"} | dtype | Model | Gradients | Backward pass | Optimizer step | Highest | |---------|:-----|:------:|:------:|:------:|:------:| | float32 | 28.21 GB | 28.21 GB | 56.43 GB | 112.84 GB | 112.84 GB | | float16 | 28.21 GB* | 42.32 GB | 56.43 GB | 56.43 GB | 56.43 GB | ::: Well, *I* don't have 56GB of GPU memory in a single card, let alone 112GB. What can we do? # Distributed Training ## Kinds of Training * Single GPU: * No distributed techniques at play * Distributed Data Parallelism (DDP): * A full copy of the model exists on each device, but data is chunked between each GPU * Fully Sharded Data Parallelism (FSDP) & DeepSpeed (DS): * Split chunks of the model and optimizer states across GPUs, allowing for training bigger models on smaller (multiple) GPUs # Fully Sharded Data Parallelism ## Fully Sharded Data Parallelism ![](fsdp.png) :::{.notes} * Take the model and split it across `n` GPUs * Each GPU computes the shard's gradients * At the end, all gradients are synchronized and the final full model gradient is calculated * The backward pass can then be performed ::: ## FSDP: Getting parameter specific * Different parameters can dicatate how much memory is needed for total GPU training across multiple GPUs * These include how model weights are sharded, gradients, and more. * I'll cover some important ones I needed when doing a Full-Fine-Tune of Llama-3-8B *without PEFT* on 2x4090's ## `sharding_strategy` * Dictates the level of divving resources to perform * `FULL_SHARD`: Includes optimizer states, gradients, and parameters * `SHARD_GRAD_OP`: Includes optimizer states and gradients * `NO_SHARD`: Normal DDP * `HYBRID_SHARD`: Includes optimizer states, gradients, and parameters but each node has the full model :::{.notes} FULL_SHARD: Parameters, Gradients, Optimizer States: All are sharded. Parameters Handling: Unshard before forward pass, reshard after forward pass, unshard before backward pass, reshard after backward pass. Gradients Handling: Synchronize and shard after backward pass. Optimizer States: Updated locally per rank. SHARD_GRAD_OP: Gradients and Optimizer States: Sharded during computation. Parameters: Unshard before forward pass, remain unsharded during forward pass, reshard after backward pass. Inside no_sync(): Parameters are not resharded after backward computation. Optimizer States: Updated locally per rank. NO_SHARD: Parameters, Gradients, Optimizer States: Not sharded, replicated across ranks. Gradients Handling: Synchronized via all-reduce after backward pass. Optimizer States: Updated locally per rank. HYBRID_SHARD: Parameters, Gradients, Optimizer States: Combines FULL_SHARD within a node and replicates parameters across nodes. Communication: Expensive operations like all-gathers and reduce-scatters are limited to within a node, enhancing performance for medium-sized models. ::: ## `auto_wrap_policy`: * How the model should be split * Can be either `TRANSFORMER_BASED_WRAP` or `SIZE_BASED_WRAP` * `TRANSFORMER`/`fsdp_transformers_layer_cls_to_wrap`: * Need to declare the layer * Generally `transformers` has good defaults * `SIZE`/`fsdp_min_num_param`: * Number of total parameters in a shard ## `offload_params`: * Offloads the parameters and gradients to the CPU if they can't fit into memory * Allows you to train much larger models locally, but will be much slower > Case: FFT of Llama-3-8B with `fsdp_offload_params` on 2x4090 GPUs was 72hrs, vs ~an hour or two when using 1xH100 ## `cpu_ram_efficient_loading` and `sync_module_states` * Uses the idea behind big model inference/the `meta` device to load in the model to the GPU in a low-ram scenario * Rather than needing `model_size` * `n_gpus` RAM, we can load the model on a single node and then send the weights directly to each shard when the time is right via `sync_module_states` # Tying this to 🤗 Accelerate ## Tying this to 🤗 Accelerate * So far we've covered the theory, but how do we put it into practice * By using a library that's at the heart of the entire open-source ecosystem ::: {style="font-size: 60%;padding-left:10%;padding-top:0%;"} * Nearly all of 🤗 * `axolotl` * `fastai` * `FastChat` * `lucidrains` * `kornia` ::: Are you using it and you don't even know? ## What is 🤗 Accelerate? ```{mermaid} %%| fig-height: 6 graph LR A(("🤗 Accelerate#32;")) A --> B["CLI Interface#32;"] A --> C["Training Library#32;"] A --> D["Big Model
Inference#32;"] ``` ## A CLI Interface * `accelerate config` * Configure the environment * `accelerate estimate-memory` * How to guess vRAM requirements * `accelerate launch` * How to run your script ## Launching distributed training is hard - ```bash python script.py ``` - ```bash torchrun --nnodes=1 --nproc_per_node=2 script.py ``` - ```bash deepspeed --num_gpus=2 script.py ``` How can we make this better? ## `accelerate launch` ```bash accelerate launch script.py ``` ## `accelerate config` * Rely on `config.yaml` files * Choose to either running `accelerate config` or write your own: :::: {.columns style="font-size: 50%;padding-left:10%;"} ::: {.column width="40%"} ```{.yaml filename=ddp_config.yaml} compute_environment: LOCAL_MACHINE distributed_type: MULTI_GPU main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 ``` ::: ::: {.column width="40%"} ```{.yaml filename=fsdp_config.yaml} compute_environment: LOCAL_MACHINE distributed_type: FSDP fsdp_config: fsdp_auto_wrap_policy: TRANSFORMER_BASED_WRAP fsdp_backward_prefetch: BACKWARD_PRE fsdp_cpu_ram_efficient_loading: true fsdp_forward_prefetch: false fsdp_offload_params: false fsdp_sharding_strategy: FULL_SHARD fsdp_state_dict_type: SHARDED_STATE_DICT fsdp_sync_module_states: true fsdp_use_orig_params: false main_training_function: main mixed_precision: bf16 num_machines: 1 num_processes: 8 ``` ::: :::: # A Training Library ## A Training Library: The Code :::: {.columns style="font-size: 50%;"} ::: {.column}


```{.python code-line-numbers="5-6,9"} # For alignment purposes for batch in dataloader: optimizer.zero_grad() inputs, targets = batch inputs = inputs.to(device) targets = targets.to(device) outputs = model(inputs) loss = loss_function(outputs, targets) loss.backward() optimizer.step() scheduler.step() ``` ::: ::: {.column} ```{.python code-line-numbers="1-7,12-13,16"} from accelerate import Accelerator accelerator = Accelerator() dataloader, model, optimizer scheduler = ( accelerator.prepare( dataloader, model, optimizer, scheduler ) ) for batch in dataloader: optimizer.zero_grad() inputs, targets = batch # inputs = inputs.to(device) # targets = targets.to(device) outputs = model(inputs) loss = loss_function(outputs, targets) accelerator.backward(loss) # loss.backward() optimizer.step() scheduler.step() ``` ::: :::: ## A Training Library: How Scaling Works * Accelerate's DataLoaders and schedulers work off of a sharding mindset * Rather than repeating the same data across `n` nodes, we instead split it * Speeds up training linearly * Given a batch size of 16 on a single GPU, to recreate this across 8 GPUs you would use a batch size of 2 * This also means the scheduler will be stepped `n` GPUs at a time per "global step" ## A Training Library: Mixed Precision * This may be a bit different than your "normal" idea of mixed precision. * We do **not** convert the model weights to BF16/FP16 * Instead we **wrap the forward pass** with `autocast` to convert the gradients automatically * This preserves the original precision of the weights, which leads to stable training and better fine-tuning later on. * **If you use `.bf16()` weights, you are STUCK in bf16 perminantly** ## A Training Library: Mixed Precision * Let's tie that back up to the model estimator with neat tools like NVIDIA's TransformerEngine ::: {style="font-size: 60%;"} | Optimization Level | Computation (GEMM) | Comm | Weight | Master Weight | Weight Gradient | Optimizer States | | -- | -- | -- | -- | -- | -- | -- | | FP16 AMP | FP16 | FP32 | FP32 | N/A | FP32 | FP32+FP32 | | Nvidia TE | FP8 | FP32 | FP32 | N/A | FP32 | FP32+FP32 | | MS-AMP O1 | FP8 | FP8 | FP16 | N/A | FP8 | FP32+FP32 | | MS-AMP O2 | FP8 | FP8 | FP16 | N/A | FP8 | FP8+FP16 | | MS-AMP O3 | FP8 | FP8 | FP8 | FP16 | FP8 | FP8+FP16 | ::: :::{.notes} What is actually happening: * Linear Layers and other certain compatible layers are wrapped in a special version that allows for FP8 computation * The general forward pass is wrapped around BF16 * This means that the most memory saved is done during the gradients of the model, *not* the model itself. * With tools like `MS-AMP` we can convert more chunks into lower precision, but again like before stable training occurs when the models weights are in full precision and the backprop happens in full precision too. ::: ## DeepSpeed vs Fully Sharded Data Parallelism * Extremely similar, however mostly used different naming conventions for items and slight tweaks in the implementation ::: {style="font-size: 50%;"} Framework | Model Loading (`torch_dtype`) | Mixed Precision | Preparation (Local) | Training | Optimizer (Local) --|--|--|--|--|-- FSDP | bf16 | default (none) | bf16 | bf16 | bf16 FSDP | bf16 | bf16 | fp32 | bf16 | fp32 DeepSpeed | bf16 | bf16 | fp32 | bf16 | fp32 ::: To learn more, check out the [documentation](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed) or join my office hours ## Key Takeaways: * You can scale out training with `accelerate`, FSDP, and DeepSpeed across multiple GPUs to train bigger models * Techniques like `FP8` can help speed up training some and reduce computational overhead * Comes at a cost of end-precision and locking model weights for futher fine-tunes if not careful ## Some Handy Resources - [🤗 Accelerate documentation](https://hf.co/docs/accelerate) - [Launching distributed code](https://huggingface.co/docs/accelerate/basic_tutorials/launch) - [Distributed code and Jupyter Notebooks](https://huggingface.co/docs/accelerate/basic_tutorials/notebook) - [Migrating to 🤗 Accelerate easily](https://huggingface.co/docs/accelerate/basic_tutorials/migration) - [Big Model Inference tutorial](https://huggingface.co/docs/accelerate/usage_guides/big_modeling) - [DeepSpeed and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/deepspeed) - [Fully Sharded Data Parallelism and 🤗 Accelerate](https://huggingface.co/docs/accelerate/usage_guides/fsdp) - [FSDP vs DeepSpeed In-Depth](https://huggingface.co/docs/accelerate/concept_guides/fsdp_and_deepspeed)