lvwerra HF staff commited on
Commit
44ecf71
·
verified ·
1 Parent(s): 36cded8
Files changed (2) hide show
  1. dist/index.html +26 -35
  2. src/index.html +26 -35
dist/index.html CHANGED
@@ -2405,53 +2405,45 @@
2405
  <h2>Conclusion</h2>
2406
 
2407
 
2408
- <p>Congratulations! You've completed quite a journey - from understanding how to train a simple model on a single GPU, all the way to mastering the complex techniques used to efficiently train massive language models like Llama-405B and DeepSeek-V3. By now, you should feel confident interpreting advanced parallelism diagrams like the one below, which would have seemed daunting when you first started.</p>
2409
 
2410
  <p><img alt="image.png" src="/assets/images/conclusion_llama3_parallelism.png" /></p>
2411
 
2412
- <p>In distributed training, many concepts sound easy enough when you first hear them, like “Pipeline parallelism just distributes layers on different GPUs”, but we also worked through all the challenging details when implementing those methods. </p>
2413
 
2414
- <p>However, not only did you learn something in the process, but we also want to share some insights we gained along the way, as well as give you ideas on what to work on next if you want to gain more experience in distributed training.</p>
2415
-
2416
- <p>Let’s start with a brief recap of all the things we covered in these past hours and days!</p>
2417
-
2418
- <h3>What you learned</h3>
2419
-
2420
- <p>Working through this whole blog post you mastered a ranged of concepts:</p>
2421
-
2422
- <ul>
2423
- <li>Basic principle of model training</li>
2424
- <li>Collective communication primitives </li>
2425
- <li>Memory anatomy of a LLM</li>
2426
- <li>Distributed training with DP and ZeRO </li>
2427
- <li>Model parallelism with TP, SP, CP and PP</li>
2428
- <li>Fast kernels and mixed precision training</li>
2429
- <li>Overlapping communication and computation</li>
2430
- <li>Profiling distributed training</li>
2431
- </ul>
2432
 
2433
- <p>Furthermore, you saw code implementations of most methods and how to benchmark a distributed training. But it hasn’t been only a learning experience for you, also we learned a thing or two!</p>
2434
 
2435
  <h3>What we learned</h3>
2436
 
2437
- <p>Running benchmarks on a cluster turned out to be much more challenging than we initially expected! What seemed like straightforward tasks often became complex debugging sessions:
 
 
 
 
 
2438
  </p>
2439
 
2440
  <ul>
2441
  <li>PyTorch processes would sometimes fail to clean up properly</li>
2442
  <li>Slurm job manager would forcefully terminate jobs, leading to node failures </li>
2443
  <li>Simple benchmarks that should take minutes would stretch into hours</li>
2444
- <li>We had to spend significant time:</li>
2445
- <ul>
2446
- <li>Minimizing cluster restart times and optimize idle time</li>
2447
- <li>Analyzing detailed NCCL debug logs</li>
2448
- <li>Understand memory usage patterns and CUDA memory allocator behaviors</li>
2449
- <li>Improving pipeline parallelism performance on multi-node</li>
2450
- </ul>
 
 
 
2451
  </ul>
2452
 
2453
  <p>These challenges deserve their own story, but they taught us valuable lessons about the complexities of distributed training infrastructure. What looks simple in theory often requires careful attention to many moving parts in practice.</p>
2454
 
 
2455
  <p>Let's analyze the results of our benchmarks and understand how different configurations affect each other. All benchmarks were run with a sequence length of 4096 and a global batch size of 1M tokens. We'll look at two key visualizations that help illustrate our findings.
2456
  </p>
2457
 
@@ -2462,7 +2454,6 @@
2462
 
2463
  <p>To complement this, let's look at the relationships between different parameters:</p>
2464
 
2465
- <!-- <p><img alt="image.png" src="/assets/images/what_we_learnt_parallel_coordinates.html" /></p> -->
2466
  <iframe id="plotFrame" src="/assets/images/what_we_learnt_parallel_coordinates.html" height="540" width="1000" scrolling="no" frameborder="0"></iframe>
2467
 
2468
  <p>Parallel coordinates plot showing the relationship between different model parallelism configurations (Data Parallel degree, Tensor Parallel degree, Pipeline Parallel degree), training hyperparameters (gradient accumulation steps, micro batch size), ZeRO stage and the resulting Model FLOPs Utilization (MFU). Each line represents a different training configuration, with colors indicating the MFU value - warmer colors show higher efficiency.</p>
@@ -2476,19 +2467,19 @@
2476
  <li>Larger models present a different challenge. As model size increases, memory requirements grow substantially. This creates two scenarios with fewer nodes: either the model doesn't fit at all, or it barely fits but runs inefficiently due to operating near the GPU memory limits.</li>
2477
  <li>Our benchmarks demonstrate how performance heavily depends on implementation quality. When we first implemented both parallelism strategies, Tensor Parallelism (TP) outperformed Pipeline Parallelism (PP). After optimizing our PP code, it became the faster option. Now that we're improving the communication overlap in our TP implementation, we expect it to regain the performance lead.</li>
2478
  </ol>
 
 
2479
 
2480
- <p>These findings highlight the challenges of reproducing theoretical results in practice, especially given the limited availability of production training code. Through open-source projects like picotron and nanotron, we hope to make these distributed training techniques more accessible and foster collaboration on simpler, more efficient codebases that help researchers and practitioners make the most of their hardware resources.</p>
2481
-
2482
- <h3>What’s next?</h3>
2483
 
2484
- <p>You should have a good overview of all the distributed training concepts but there are still things to learn and details we couldn’t cover. To get deeper in the field we recommend doing some of the following steps:</p>
2485
  <ul>
2486
  <li>Carefully read some of the landmark or very recent papers. You can find a list of some of the most impactful papers in <a target="_self" href="#references" class="">References</a>.</li>
2487
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2488
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2489
  </ul>
2490
 
2491
- <p>We hope this blog helps you get started in distributed training or helps you to better understand methods that you may already be applying by using some distributed training frameworks.</p>
2492
 
2493
  <h2>References</h2>
2494
 
 
2405
  <h2>Conclusion</h2>
2406
 
2407
 
2408
+ <p>Congratulations, dear reader, you made it to the end! We've completed quite a journey: we started from understanding how to train a simple model on a single GPU, all the way to mastering all the intricate techniques used to efficiently train massive language models like Llama-405B and DeepSeek-V3 on thousands of GPUs. By now, you can read a diagram, like Llama-3's 4D parallel setup, with ease:</p>
2409
 
2410
  <p><img alt="image.png" src="/assets/images/conclusion_llama3_parallelism.png" /></p>
2411
 
2412
+ <p>Orchestrating large clusters of GPUs to train LLMs efficiently is no easy feat. We learned how to optimize computations and communications between GPUs such that they run with maximum utilization at all times. It involves choosing the right parallelization strategy for a given model and cluster size, overlapping communication and computation where possible, and writing custom kernels that take into account the hardware layout to perform an operation as fast as possible on the GPU.</p>
2413
 
2414
+ <p>You might still believe that this knowledge is a bit niche and only concerns the small set of people that pretrain LLMs. Historically, that mayb be true, but as models are growing rapidly even people who want to fine-tune models require distributd training setups. So diving deeper into all things distributed might prove very timely.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2415
 
2416
+ <p>This has been a long learning journey, but not just for you! Running thousands of benchmarks on a GPU cluster was more challenging than we anticipated and we want to share a few highlights of our learning experience.</p>
2417
 
2418
  <h3>What we learned</h3>
2419
 
2420
+ <p>Our goal for this blogpost was not only to discuss theory and implementations but provide actual data points as well. So the plan was simple: lets run every possible distributed configuration for every model and a number of cluster sizes (namely 1-64 nodes of 8xH100s). Even after excluding impossible configuration we still needed to run thousands of experiments. </p>
2421
+
2422
+ <aside>We want to take this opportunity to apologize to our co-workers for blocking most of the science cluster and in turn forgive any threats that may have been whispered.</aside>
2423
+
2424
+ <p>
2425
+ On paper this sounds easy enough: we can easily launch big arrays of jobs on our cluster. However, when we launched the first batches is when the troubles began:
2426
  </p>
2427
 
2428
  <ul>
2429
  <li>PyTorch processes would sometimes fail to clean up properly</li>
2430
  <li>Slurm job manager would forcefully terminate jobs, leading to node failures </li>
2431
  <li>Simple benchmarks that should take minutes would stretch into hours</li>
2432
+ <li>Some jobs would hang indefinitely</li>
2433
+ </ul>
2434
+
2435
+ <p>So in order to run all experiments in a finite amount of time required some additional engineering. In particular we spent a significant amount of time on the following:</p>
2436
+
2437
+ <ul>
2438
+ <li>Minimizing cluster restart times and optimize idle time</li>
2439
+ <li>Analyzing detailed NCCL debug logs</li>
2440
+ <li>Understand memory usage patterns and CUDA memory allocator behaviors</li>
2441
+ <li>Improving pipeline parallelism performance on multi-node</li>
2442
  </ul>
2443
 
2444
  <p>These challenges deserve their own story, but they taught us valuable lessons about the complexities of distributed training infrastructure. What looks simple in theory often requires careful attention to many moving parts in practice.</p>
2445
 
2446
+ <!--
2447
  <p>Let's analyze the results of our benchmarks and understand how different configurations affect each other. All benchmarks were run with a sequence length of 4096 and a global batch size of 1M tokens. We'll look at two key visualizations that help illustrate our findings.
2448
  </p>
2449
 
 
2454
 
2455
  <p>To complement this, let's look at the relationships between different parameters:</p>
2456
 
 
2457
  <iframe id="plotFrame" src="/assets/images/what_we_learnt_parallel_coordinates.html" height="540" width="1000" scrolling="no" frameborder="0"></iframe>
2458
 
2459
  <p>Parallel coordinates plot showing the relationship between different model parallelism configurations (Data Parallel degree, Tensor Parallel degree, Pipeline Parallel degree), training hyperparameters (gradient accumulation steps, micro batch size), ZeRO stage and the resulting Model FLOPs Utilization (MFU). Each line represents a different training configuration, with colors indicating the MFU value - warmer colors show higher efficiency.</p>
 
2467
  <li>Larger models present a different challenge. As model size increases, memory requirements grow substantially. This creates two scenarios with fewer nodes: either the model doesn't fit at all, or it barely fits but runs inefficiently due to operating near the GPU memory limits.</li>
2468
  <li>Our benchmarks demonstrate how performance heavily depends on implementation quality. When we first implemented both parallelism strategies, Tensor Parallelism (TP) outperformed Pipeline Parallelism (PP). After optimizing our PP code, it became the faster option. Now that we're improving the communication overlap in our TP implementation, we expect it to regain the performance lead.</li>
2469
  </ol>
2470
+ -->
2471
+ <p>Reproducing theoretical results in practice is challenging, especially given the limited availability of production training code. Through open-source projects like picotron and nanotron, we hope to make these distributed training techniques more accessible and foster collaboration on simpler, more efficient codebases that help researchers and practitioners make the most of their hardware resources.</p>
2472
 
2473
+ <h3>So, what’s next?</h3>
 
 
2474
 
2475
+ <p>You now have good overview of the main distributed training concepts but at the same time we just scratched to surface of on some aspects. There are many ways to dive deep into a subject but here are some steps that we recommend:</p>
2476
  <ul>
2477
  <li>Carefully read some of the landmark or very recent papers. You can find a list of some of the most impactful papers in <a target="_self" href="#references" class="">References</a>.</li>
2478
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2479
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2480
  </ul>
2481
 
2482
+ <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2483
 
2484
  <h2>References</h2>
2485
 
src/index.html CHANGED
@@ -2405,53 +2405,45 @@
2405
  <h2>Conclusion</h2>
2406
 
2407
 
2408
- <p>Congratulations! You've completed quite a journey - from understanding how to train a simple model on a single GPU, all the way to mastering the complex techniques used to efficiently train massive language models like Llama-405B and DeepSeek-V3. By now, you should feel confident interpreting advanced parallelism diagrams like the one below, which would have seemed daunting when you first started.</p>
2409
 
2410
  <p><img alt="image.png" src="/assets/images/conclusion_llama3_parallelism.png" /></p>
2411
 
2412
- <p>In distributed training, many concepts sound easy enough when you first hear them, like “Pipeline parallelism just distributes layers on different GPUs”, but we also worked through all the challenging details when implementing those methods. </p>
2413
 
2414
- <p>However, not only did you learn something in the process, but we also want to share some insights we gained along the way, as well as give you ideas on what to work on next if you want to gain more experience in distributed training.</p>
2415
-
2416
- <p>Let’s start with a brief recap of all the things we covered in these past hours and days!</p>
2417
-
2418
- <h3>What you learned</h3>
2419
-
2420
- <p>Working through this whole blog post you mastered a ranged of concepts:</p>
2421
-
2422
- <ul>
2423
- <li>Basic principle of model training</li>
2424
- <li>Collective communication primitives </li>
2425
- <li>Memory anatomy of a LLM</li>
2426
- <li>Distributed training with DP and ZeRO </li>
2427
- <li>Model parallelism with TP, SP, CP and PP</li>
2428
- <li>Fast kernels and mixed precision training</li>
2429
- <li>Overlapping communication and computation</li>
2430
- <li>Profiling distributed training</li>
2431
- </ul>
2432
 
2433
- <p>Furthermore, you saw code implementations of most methods and how to benchmark a distributed training. But it hasn’t been only a learning experience for you, also we learned a thing or two!</p>
2434
 
2435
  <h3>What we learned</h3>
2436
 
2437
- <p>Running benchmarks on a cluster turned out to be much more challenging than we initially expected! What seemed like straightforward tasks often became complex debugging sessions:
 
 
 
 
 
2438
  </p>
2439
 
2440
  <ul>
2441
  <li>PyTorch processes would sometimes fail to clean up properly</li>
2442
  <li>Slurm job manager would forcefully terminate jobs, leading to node failures </li>
2443
  <li>Simple benchmarks that should take minutes would stretch into hours</li>
2444
- <li>We had to spend significant time:</li>
2445
- <ul>
2446
- <li>Minimizing cluster restart times and optimize idle time</li>
2447
- <li>Analyzing detailed NCCL debug logs</li>
2448
- <li>Understand memory usage patterns and CUDA memory allocator behaviors</li>
2449
- <li>Improving pipeline parallelism performance on multi-node</li>
2450
- </ul>
 
 
 
2451
  </ul>
2452
 
2453
  <p>These challenges deserve their own story, but they taught us valuable lessons about the complexities of distributed training infrastructure. What looks simple in theory often requires careful attention to many moving parts in practice.</p>
2454
 
 
2455
  <p>Let's analyze the results of our benchmarks and understand how different configurations affect each other. All benchmarks were run with a sequence length of 4096 and a global batch size of 1M tokens. We'll look at two key visualizations that help illustrate our findings.
2456
  </p>
2457
 
@@ -2462,7 +2454,6 @@
2462
 
2463
  <p>To complement this, let's look at the relationships between different parameters:</p>
2464
 
2465
- <!-- <p><img alt="image.png" src="/assets/images/what_we_learnt_parallel_coordinates.html" /></p> -->
2466
  <iframe id="plotFrame" src="/assets/images/what_we_learnt_parallel_coordinates.html" height="540" width="1000" scrolling="no" frameborder="0"></iframe>
2467
 
2468
  <p>Parallel coordinates plot showing the relationship between different model parallelism configurations (Data Parallel degree, Tensor Parallel degree, Pipeline Parallel degree), training hyperparameters (gradient accumulation steps, micro batch size), ZeRO stage and the resulting Model FLOPs Utilization (MFU). Each line represents a different training configuration, with colors indicating the MFU value - warmer colors show higher efficiency.</p>
@@ -2476,19 +2467,19 @@
2476
  <li>Larger models present a different challenge. As model size increases, memory requirements grow substantially. This creates two scenarios with fewer nodes: either the model doesn't fit at all, or it barely fits but runs inefficiently due to operating near the GPU memory limits.</li>
2477
  <li>Our benchmarks demonstrate how performance heavily depends on implementation quality. When we first implemented both parallelism strategies, Tensor Parallelism (TP) outperformed Pipeline Parallelism (PP). After optimizing our PP code, it became the faster option. Now that we're improving the communication overlap in our TP implementation, we expect it to regain the performance lead.</li>
2478
  </ol>
 
 
2479
 
2480
- <p>These findings highlight the challenges of reproducing theoretical results in practice, especially given the limited availability of production training code. Through open-source projects like picotron and nanotron, we hope to make these distributed training techniques more accessible and foster collaboration on simpler, more efficient codebases that help researchers and practitioners make the most of their hardware resources.</p>
2481
-
2482
- <h3>What’s next?</h3>
2483
 
2484
- <p>You should have a good overview of all the distributed training concepts but there are still things to learn and details we couldn’t cover. To get deeper in the field we recommend doing some of the following steps:</p>
2485
  <ul>
2486
  <li>Carefully read some of the landmark or very recent papers. You can find a list of some of the most impactful papers in <a target="_self" href="#references" class="">References</a>.</li>
2487
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2488
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2489
  </ul>
2490
 
2491
- <p>We hope this blog helps you get started in distributed training or helps you to better understand methods that you may already be applying by using some distributed training frameworks.</p>
2492
 
2493
  <h2>References</h2>
2494
 
 
2405
  <h2>Conclusion</h2>
2406
 
2407
 
2408
+ <p>Congratulations, dear reader, you made it to the end! We've completed quite a journey: we started from understanding how to train a simple model on a single GPU, all the way to mastering all the intricate techniques used to efficiently train massive language models like Llama-405B and DeepSeek-V3 on thousands of GPUs. By now, you can read a diagram, like Llama-3's 4D parallel setup, with ease:</p>
2409
 
2410
  <p><img alt="image.png" src="/assets/images/conclusion_llama3_parallelism.png" /></p>
2411
 
2412
+ <p>Orchestrating large clusters of GPUs to train LLMs efficiently is no easy feat. We learned how to optimize computations and communications between GPUs such that they run with maximum utilization at all times. It involves choosing the right parallelization strategy for a given model and cluster size, overlapping communication and computation where possible, and writing custom kernels that take into account the hardware layout to perform an operation as fast as possible on the GPU.</p>
2413
 
2414
+ <p>You might still believe that this knowledge is a bit niche and only concerns the small set of people that pretrain LLMs. Historically, that mayb be true, but as models are growing rapidly even people who want to fine-tune models require distributd training setups. So diving deeper into all things distributed might prove very timely.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2415
 
2416
+ <p>This has been a long learning journey, but not just for you! Running thousands of benchmarks on a GPU cluster was more challenging than we anticipated and we want to share a few highlights of our learning experience.</p>
2417
 
2418
  <h3>What we learned</h3>
2419
 
2420
+ <p>Our goal for this blogpost was not only to discuss theory and implementations but provide actual data points as well. So the plan was simple: lets run every possible distributed configuration for every model and a number of cluster sizes (namely 1-64 nodes of 8xH100s). Even after excluding impossible configuration we still needed to run thousands of experiments. </p>
2421
+
2422
+ <aside>We want to take this opportunity to apologize to our co-workers for blocking most of the science cluster and in turn forgive any threats that may have been whispered.</aside>
2423
+
2424
+ <p>
2425
+ On paper this sounds easy enough: we can easily launch big arrays of jobs on our cluster. However, when we launched the first batches is when the troubles began:
2426
  </p>
2427
 
2428
  <ul>
2429
  <li>PyTorch processes would sometimes fail to clean up properly</li>
2430
  <li>Slurm job manager would forcefully terminate jobs, leading to node failures </li>
2431
  <li>Simple benchmarks that should take minutes would stretch into hours</li>
2432
+ <li>Some jobs would hang indefinitely</li>
2433
+ </ul>
2434
+
2435
+ <p>So in order to run all experiments in a finite amount of time required some additional engineering. In particular we spent a significant amount of time on the following:</p>
2436
+
2437
+ <ul>
2438
+ <li>Minimizing cluster restart times and optimize idle time</li>
2439
+ <li>Analyzing detailed NCCL debug logs</li>
2440
+ <li>Understand memory usage patterns and CUDA memory allocator behaviors</li>
2441
+ <li>Improving pipeline parallelism performance on multi-node</li>
2442
  </ul>
2443
 
2444
  <p>These challenges deserve their own story, but they taught us valuable lessons about the complexities of distributed training infrastructure. What looks simple in theory often requires careful attention to many moving parts in practice.</p>
2445
 
2446
+ <!--
2447
  <p>Let's analyze the results of our benchmarks and understand how different configurations affect each other. All benchmarks were run with a sequence length of 4096 and a global batch size of 1M tokens. We'll look at two key visualizations that help illustrate our findings.
2448
  </p>
2449
 
 
2454
 
2455
  <p>To complement this, let's look at the relationships between different parameters:</p>
2456
 
 
2457
  <iframe id="plotFrame" src="/assets/images/what_we_learnt_parallel_coordinates.html" height="540" width="1000" scrolling="no" frameborder="0"></iframe>
2458
 
2459
  <p>Parallel coordinates plot showing the relationship between different model parallelism configurations (Data Parallel degree, Tensor Parallel degree, Pipeline Parallel degree), training hyperparameters (gradient accumulation steps, micro batch size), ZeRO stage and the resulting Model FLOPs Utilization (MFU). Each line represents a different training configuration, with colors indicating the MFU value - warmer colors show higher efficiency.</p>
 
2467
  <li>Larger models present a different challenge. As model size increases, memory requirements grow substantially. This creates two scenarios with fewer nodes: either the model doesn't fit at all, or it barely fits but runs inefficiently due to operating near the GPU memory limits.</li>
2468
  <li>Our benchmarks demonstrate how performance heavily depends on implementation quality. When we first implemented both parallelism strategies, Tensor Parallelism (TP) outperformed Pipeline Parallelism (PP). After optimizing our PP code, it became the faster option. Now that we're improving the communication overlap in our TP implementation, we expect it to regain the performance lead.</li>
2469
  </ol>
2470
+ -->
2471
+ <p>Reproducing theoretical results in practice is challenging, especially given the limited availability of production training code. Through open-source projects like picotron and nanotron, we hope to make these distributed training techniques more accessible and foster collaboration on simpler, more efficient codebases that help researchers and practitioners make the most of their hardware resources.</p>
2472
 
2473
+ <h3>So, what’s next?</h3>
 
 
2474
 
2475
+ <p>You now have good overview of the main distributed training concepts but at the same time we just scratched to surface of on some aspects. There are many ways to dive deep into a subject but here are some steps that we recommend:</p>
2476
  <ul>
2477
  <li>Carefully read some of the landmark or very recent papers. You can find a list of some of the most impactful papers in <a target="_self" href="#references" class="">References</a>.</li>
2478
  <li>Start from scratch and implement an algorithm yourself. Often a method only fully “clicks” if you implemented it yourself.</li>
2479
  <li>Dive into one of the widely used frameworks and start contributing: fix bugs, answer issues, or implement a new feature. That’s the best way to get in any ML field!</li>
2480
  </ul>
2481
 
2482
+ <p>We hope this book helps you get started in distributed training and that you will train the next generation of awesome models to the hum of your GPU cluster!</p>
2483
 
2484
  <h2>References</h2>
2485