continuing fixes thom

#58
by thomwolf HF staff - opened
assets/images/memorycoalescing.png CHANGED

Git LFS Details

  • SHA256: 96ed02089819123c2ec48b178d4b673cc4f628f4c903f02ad98c5588cf3e1931
  • Pointer size: 131 Bytes
  • Size of remote file: 155 kB

Git LFS Details

  • SHA256: 801f674da0891f89cfa79a5a6dca52c686bdb1159b965a0ab60e0ea2b936a9e5
  • Pointer size: 131 Bytes
  • Size of remote file: 255 kB
dist/assets/images/memorycoalescing.png CHANGED

Git LFS Details

  • SHA256: d23334cf322cb02cac6269f02d0e1f6d1922a6bef27ba241d71733b375158921
  • Pointer size: 130 Bytes
  • Size of remote file: 49.6 kB

Git LFS Details

  • SHA256: 1094fe9aeb953c743791445ee6d7e73a5a89fa85fe60f4312266d1265e7c591a
  • Pointer size: 130 Bytes
  • Size of remote file: 94.1 kB
dist/index.html CHANGED
@@ -1681,7 +1681,7 @@
1681
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1682
 
1683
  <div class="large-image-background">
1684
- <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1685
  </div>
1686
 
1687
 
@@ -1828,7 +1828,7 @@
1828
  </tbody>
1829
  </table>
1830
 
1831
- <p>Clearly, none of these techniques is a silver bullet for magical scaling we'll often have to combine them in one way or another. Can we actually come up with a few rules that help finding a good starting point to select and combine them? This will be the topic of our next section.</p>
1832
 
1833
  <h2>How to Find the Best Training Configuration</h2>
1834
 
@@ -1838,9 +1838,9 @@
1838
 
1839
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1840
 
1841
- <p>First, we need to figure out how we can fit a full model instance on our GPUs (we focus on a single instance for now - even though we may use DP for ZeRO). There are two general cases.</p>
1842
-
1843
- <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
1844
  <ul>
1845
  <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
1846
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
@@ -1852,20 +1852,22 @@
1852
  <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
1853
  <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1854
  </ul>
 
 
1855
 
1856
- <p>Special considerations:</p>
1857
  <ul>
1858
  <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
1859
  <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
1860
  </ul>
1861
-
1862
- <p>GPU-poor case 😭 - when you might be low on GPU resources:</p>
1863
  <ul>
1864
  <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
1865
  <li>You can increase gradient accumulation to process larger batches with limited memory.
1866
  </li>
1867
  </ul>
1868
-
1869
  <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
1870
 
1871
  <h3>Step 2: Achieving Target Global Batch Size </h3>
@@ -1903,7 +1905,17 @@
1903
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1904
  -->
1905
 
1906
- <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p>
 
 
 
 
 
 
 
 
 
 
1907
 
1908
  <p>Time to turn the lights off and activate CUDA mode! </p>
1909
 
@@ -2148,12 +2160,12 @@
2148
  </div>
2149
 
2150
 
2151
- <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
2152
 
2153
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
2154
 
2155
 
2156
- <p>To improve our kernel we can change the way the coordinates x and y are calculated like the following : </p>
2157
 
2158
  <d-code block language="clike">
2159
  const int x = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
@@ -2172,11 +2184,12 @@
2172
 
2173
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2174
 
2175
- <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2176
-
 
2177
 
2178
- <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
2179
- <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2180
 
2181
 
2182
  <h4>Tiling</h4>
@@ -2186,12 +2199,14 @@
2186
 
2187
  <p>In matrix multiplication for example, each thread in a block may need elements from two matrices, say A and B. If each thread independently loads the row and column it needs from global memory, we end up with many redundant loads, as multiple threads in a block will access overlapping data. Instead, we can use tiling to load a block (or tile) of A and B into shared memory just once, allowing all threads in that block to reuse the same shared data.</p>
2188
 
2189
- <p>In the tiling approach, each iteration involves all threads within a block cooperatively loading two tiles—one from matrix A and another from matrix B —into shared memory. Specifically, threads load a tile of matrix A (of size <code>BLOCK_SIZE_M</code> by <code>BLOCK_SIZE_K</code>) and a tile of matrix B (of size <code>BLOCK_SIZE_K</code> by <code>BLOCK_SIZE_N</code>). Once the tiles are in shared memory, the threads perform matrix multiplication on these tiles, enabling efficient computation since all necessary data is quickly accessible. The results of the tile multiplication are stored in an accumulation matrix that holds intermediate results. After each iteration, the results from the current tile multiplication are added to this accumulation matrix, continuing until all tiles from both matrices have been processed.</p>
2190
 
2191
  <p><img alt="image.png" src="/assets/images/tiling.png" /></p>
2192
- <p>From https://cnugteren.github.io/tutorial/pages/page4.html</p>
 
 
2193
 
2194
- <p>The important parts to understand the implementation are below (for simplicity we consider a square shaped tile) : </p>
2195
 
2196
  <d-code block language="clike">
2197
  // Set pointers to the starting elements
@@ -2221,6 +2236,7 @@
2221
  }
2222
  C[localRow * N + localCol] = sum;
2223
  </d-code>
 
2224
 
2225
  <p>Each thread begins by loading one element from both <strong>Matrix A</strong> and <strong>Matrix B</strong> into shared memory. In this scenario, achieving coalesced memory access is straightforward, by assigning <code>threadIdx.x</code> as the <strong>local column index (localCol)</strong>, threads within the same warp will access adjacent elements of both matrices. After each thread in the block completes loading its elements into shared memory (ensured by calling <code>__syncthreads()</code>), they proceed to compute the dot product of the two tiles. Once the threads have iterated through all the tiles—horizontally for <strong>Matrix A</strong> and vertically for <strong>Matrix B</strong>—the resulting sum is stored in the corresponding location of <strong>Matrix C</strong>.</p>
2226
 
@@ -2236,11 +2252,13 @@
2236
  <p><img alt="image.png" src="/assets/images/threadcoarsening.png" /></p>
2237
 
2238
 
2239
- <p>The meaning of the states can be found in the <a href="https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference">Profiling Guide</a>, specifically in the <strong>Warp Stall Reasons</strong> section. There we can read that:</p>
2240
 
2241
  <p><em><code>"smsp__pcsamp_warps_issue_stalled_mio_throttle</code>: Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory accesses, trying to use fewer but wider loads can reduce pipeline pressure."</em></p>
2242
 
2243
- <p>So it seems warps are stalling waiting for shared memory accesses to return ! To resolve this issue we can apply the <strong>Thread Coarsening</strong> technique by merging several threads into a single coarsened thread, we can significantly reduce shared memory accesses because each coarsened thread can handle multiple output elements which would increase the arithmetic intensity of the kernel.</p>
 
 
2244
 
2245
  <h4>Minimizing Control Divergence</h4>
2246
 
@@ -2248,6 +2266,8 @@
2248
 
2249
  <p>Control divergence occurs when threads within the same warp take different execution paths. For instance, if a conditional statement (like an <code>if</code> statement) leads to some threads executing one block of code while others execute a different block, the warp must serialize these executions, resulting in idle threads waiting for others to complete. To minimize control divergence, we need to design kernels to ensure that threads within the same warp follow the same execution path. This can be achieved by restructuring code to reduce branching, using data structures that ensure all threads follow similar execution paths, or employing techniques such as predication.</p>
2250
 
 
 
2251
  <p>We have covered some of the main considerations when writing custom kernels and improving the performance and memory footprint of GPU operations. But there’s one more important concept before moving to a real example which is “fusing kernels”.</p>
2252
 
2253
  <h3>Fused Kernels</h3>
 
1681
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1682
 
1683
  <div class="large-image-background">
1684
+ <img alt="TP & SP diagram" src="/assets/images/5d_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1685
  </div>
1686
 
1687
 
 
1828
  </tbody>
1829
  </table>
1830
 
1831
+ <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine them? This will be the topic of our next section.</p>
1832
 
1833
  <h2>How to Find the Best Training Configuration</h2>
1834
 
 
1838
 
1839
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1840
 
1841
+ <p>First, we need to figure out how we can fit a full model instance on our GPUs. There are two general cases.</p>
1842
+
1843
+ <p><strong>GPU-rich case 🤑</strong> - when you have plenty of GPUs available:
1844
  <ul>
1845
  <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
1846
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
 
1852
  <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
1853
  <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1854
  </ul>
1855
+ <aside>We focus on fitting a single instance for now - even though we may use DP for ZeRO to achieve this goal - we're only interested here in the model-parameters memory savings that it provide when used with ZeRO-3.</aside>
1856
+ </p>
1857
 
1858
+ <p>Special considerations:
1859
  <ul>
1860
  <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
1861
  <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
1862
  </ul>
1863
+ </p>
1864
+ <p><strong>GPU-poor case 😭</strong> - when you might be low on GPU resources:
1865
  <ul>
1866
  <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
1867
  <li>You can increase gradient accumulation to process larger batches with limited memory.
1868
  </li>
1869
  </ul>
1870
+ </p>
1871
  <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
1872
 
1873
  <h3>Step 2: Achieving Target Global Batch Size </h3>
 
1905
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1906
  -->
1907
 
1908
+ <!-- <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p> -->
1909
+
1910
+
1911
+ <hr>
1912
+
1913
+ <p>This concludes our very deep dive into the distribution methods of 5D parallelism. </p>
1914
+
1915
+
1916
+ <p>Taking a step back, our discussion so far has often relied on a critical assumption - that computation and communication can be efficiently overlapped on GPUs without any impact on the computation throughput. The reality is more nuanced. When using common communication primitives like NCCL send/recv, we face hidden contention between computation and communication resources as communication kernels will usually make use of the same GPU streaming multiprocessors (SMs) that are used for computation, leading to decreased throughput when communication is overlapped with computation. To truly optimize our distributed training, we need to dive deeper into the GPU architecture itself.</p>
1917
+ <aside>Additionally, the synchronization patterns when overlapping computation and communication may not always be optimal for our parallel strategies. You can find an example for instance in <a target="_blank" href="https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487">this blog post</a> by the Pytorch team.</aside>
1918
+
1919
 
1920
  <p>Time to turn the lights off and activate CUDA mode! </p>
1921
 
 
2160
  </div>
2161
 
2162
 
2163
+ <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning row elements are in consecutive memory addresses, as shown in the figure below) thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math> in the first iteration <code>i = 0</code>. These elements are not stored close to each other in memory, and this misalignment will be present at each iteration, thereby preventing memory accesses from being coalesced.</p>
2164
 
2165
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
2166
 
2167
 
2168
+ <p>To improve the performances of our kernel we can change the way coordinates <code></code>x</code> and <code>y</code> are calculated to the following: </p>
2169
 
2170
  <d-code block language="clike">
2171
  const int x = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
 
2184
 
2185
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2186
 
2187
+ <div class="large-image-background">
2188
+ <p><img width="1200px" alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2189
+ </div>
2190
 
2191
+ <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>! Amazing.</p>
2192
+ <p>Now let's cover another technique you will often see mentioned in the litterature: <strong>tiling</strong>.</p>
2193
 
2194
 
2195
  <h4>Tiling</h4>
 
2199
 
2200
  <p>In matrix multiplication for example, each thread in a block may need elements from two matrices, say A and B. If each thread independently loads the row and column it needs from global memory, we end up with many redundant loads, as multiple threads in a block will access overlapping data. Instead, we can use tiling to load a block (or tile) of A and B into shared memory just once, allowing all threads in that block to reuse the same shared data.</p>
2201
 
2202
+ <p>In the tiling approach, each iteration involves all threads within a block to cooperatively load two tiles—one from matrix A and another from matrix B —into shared memory. Specifically, threads load a tile of matrix A (of size <code>BLOCK_SIZE_M</code> by <code>BLOCK_SIZE_K</code>) and a tile of matrix B (of size <code>BLOCK_SIZE_K</code> by <code>BLOCK_SIZE_N</code>). Once the tiles are in shared memory, the threads perform matrix multiplication on these tiles, enabling efficient computation since all necessary data is quickly accessible. The results of the tile multiplication are stored in an accumulation matrix that holds intermediate results. After each iteration, the results from the current tile multiplication are added to this accumulation matrix, continuing until all tiles from both matrices have been processed.</p>
2203
 
2204
  <p><img alt="image.png" src="/assets/images/tiling.png" /></p>
2205
+ <div class="figure-legend">
2206
+ <p>From <a target="_blank" href="https://cnugteren.github.io/tutorial/pages/page4.html">https://cnugteren.github.io/tutorial/pages/page4.html</a></p>
2207
+ </div>
2208
 
2209
+ <p>Let's take a look at the important parts you need to understand from the implementation: </p>
2210
 
2211
  <d-code block language="clike">
2212
  // Set pointers to the starting elements
 
2236
  }
2237
  C[localRow * N + localCol] = sum;
2238
  </d-code>
2239
+ <aside>For simplicity we consider a square shaped tile.</aside>
2240
 
2241
  <p>Each thread begins by loading one element from both <strong>Matrix A</strong> and <strong>Matrix B</strong> into shared memory. In this scenario, achieving coalesced memory access is straightforward, by assigning <code>threadIdx.x</code> as the <strong>local column index (localCol)</strong>, threads within the same warp will access adjacent elements of both matrices. After each thread in the block completes loading its elements into shared memory (ensured by calling <code>__syncthreads()</code>), they proceed to compute the dot product of the two tiles. Once the threads have iterated through all the tiles—horizontally for <strong>Matrix A</strong> and vertically for <strong>Matrix B</strong>—the resulting sum is stored in the corresponding location of <strong>Matrix C</strong>.</p>
2242
 
 
2252
  <p><img alt="image.png" src="/assets/images/threadcoarsening.png" /></p>
2253
 
2254
 
2255
+ <p>The meaning of these cryptic state names can be found in <a href="https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference">NVidia's profiling Guide</a>, in the <strong>Warp Stall Reasons</strong> section. There we can read that:</p>
2256
 
2257
  <p><em><code>"smsp__pcsamp_warps_issue_stalled_mio_throttle</code>: Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory accesses, trying to use fewer but wider loads can reduce pipeline pressure."</em></p>
2258
 
2259
+ <p>So it seems warps are stalling waiting for shared memory accesses to return! To solve this issue we can apply a technique called <strong>Thread Coarsening</strong> which involves merging several threads into a single coarsened thread. This will significantly reduce shared memory accesses as each coarsened thread can handle multiple output elements.</p>
2260
+
2261
+ <p>Let's briefly mentionned a last important consideration when writing or improving custom kernels: <strong>Minimizing Control Divergence</strong>.</p>
2262
 
2263
  <h4>Minimizing Control Divergence</h4>
2264
 
 
2266
 
2267
  <p>Control divergence occurs when threads within the same warp take different execution paths. For instance, if a conditional statement (like an <code>if</code> statement) leads to some threads executing one block of code while others execute a different block, the warp must serialize these executions, resulting in idle threads waiting for others to complete. To minimize control divergence, we need to design kernels to ensure that threads within the same warp follow the same execution path. This can be achieved by restructuring code to reduce branching, using data structures that ensure all threads follow similar execution paths, or employing techniques such as predication.</p>
2268
 
2269
+ <hr>
2270
+
2271
  <p>We have covered some of the main considerations when writing custom kernels and improving the performance and memory footprint of GPU operations. But there’s one more important concept before moving to a real example which is “fusing kernels”.</p>
2272
 
2273
  <h3>Fused Kernels</h3>
dist/style.css CHANGED
@@ -436,3 +436,19 @@ d-code {
436
  display: flex;
437
  justify-content: center; /* This will center your image */
438
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
436
  display: flex;
437
  justify-content: center; /* This will center your image */
438
  }
439
+
440
+ d-article li {
441
+ margin-bottom: 0.0em;
442
+ }
443
+
444
+ d-article ul ul {
445
+ margin-bottom: 0.0em;
446
+ }
447
+
448
+ d-article ol ol {
449
+ margin-bottom: 0.0em;
450
+ }
451
+
452
+ d-article hr {
453
+ grid-column: text;
454
+ }
src/index.html CHANGED
@@ -1681,7 +1681,7 @@
1681
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1682
 
1683
  <div class="large-image-background">
1684
- <img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1685
  </div>
1686
 
1687
 
@@ -1828,7 +1828,7 @@
1828
  </tbody>
1829
  </table>
1830
 
1831
- <p>Clearly, none of these techniques is a silver bullet for magical scaling we'll often have to combine them in one way or another. Can we actually come up with a few rules that help finding a good starting point to select and combine them? This will be the topic of our next section.</p>
1832
 
1833
  <h2>How to Find the Best Training Configuration</h2>
1834
 
@@ -1838,9 +1838,9 @@
1838
 
1839
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1840
 
1841
- <p>First, we need to figure out how we can fit a full model instance on our GPUs (we focus on a single instance for now - even though we may use DP for ZeRO). There are two general cases.</p>
1842
-
1843
- <p>GPU-rich case 🤑 - when you have plenty of GPUs available:</p>
1844
  <ul>
1845
  <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
1846
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
@@ -1852,20 +1852,22 @@
1852
  <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
1853
  <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1854
  </ul>
 
 
1855
 
1856
- <p>Special considerations:</p>
1857
  <ul>
1858
  <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
1859
  <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
1860
  </ul>
1861
-
1862
- <p>GPU-poor case 😭 - when you might be low on GPU resources:</p>
1863
  <ul>
1864
  <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
1865
  <li>You can increase gradient accumulation to process larger batches with limited memory.
1866
  </li>
1867
  </ul>
1868
-
1869
  <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
1870
 
1871
  <h3>Step 2: Achieving Target Global Batch Size </h3>
@@ -1903,7 +1905,17 @@
1903
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1904
  -->
1905
 
1906
- <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p>
 
 
 
 
 
 
 
 
 
 
1907
 
1908
  <p>Time to turn the lights off and activate CUDA mode! </p>
1909
 
@@ -2148,12 +2160,12 @@
2148
  </div>
2149
 
2150
 
2151
- <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning each row's elements are in consecutive memory addresses, as shown in the figure below), in the first iteration with <code>i = 0</code>, thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math>. These elements are not stored close to each other in memory, and this misalignment repeats across all iterations along the shared dimension, preventing memory accesses from being coalesced.</p>
2152
 
2153
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
2154
 
2155
 
2156
- <p>To improve our kernel we can change the way the coordinates x and y are calculated like the following : </p>
2157
 
2158
  <d-code block language="clike">
2159
  const int x = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
@@ -2172,11 +2184,12 @@
2172
 
2173
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2174
 
2175
- <p><img alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2176
-
 
2177
 
2178
- <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>!</p>
2179
- <p>Let’s cover another technique you will often see mentioned in the litterature: tiling.</p>
2180
 
2181
 
2182
  <h4>Tiling</h4>
@@ -2186,12 +2199,14 @@
2186
 
2187
  <p>In matrix multiplication for example, each thread in a block may need elements from two matrices, say A and B. If each thread independently loads the row and column it needs from global memory, we end up with many redundant loads, as multiple threads in a block will access overlapping data. Instead, we can use tiling to load a block (or tile) of A and B into shared memory just once, allowing all threads in that block to reuse the same shared data.</p>
2188
 
2189
- <p>In the tiling approach, each iteration involves all threads within a block cooperatively loading two tiles—one from matrix A and another from matrix B —into shared memory. Specifically, threads load a tile of matrix A (of size <code>BLOCK_SIZE_M</code> by <code>BLOCK_SIZE_K</code>) and a tile of matrix B (of size <code>BLOCK_SIZE_K</code> by <code>BLOCK_SIZE_N</code>). Once the tiles are in shared memory, the threads perform matrix multiplication on these tiles, enabling efficient computation since all necessary data is quickly accessible. The results of the tile multiplication are stored in an accumulation matrix that holds intermediate results. After each iteration, the results from the current tile multiplication are added to this accumulation matrix, continuing until all tiles from both matrices have been processed.</p>
2190
 
2191
  <p><img alt="image.png" src="/assets/images/tiling.png" /></p>
2192
- <p>From https://cnugteren.github.io/tutorial/pages/page4.html</p>
 
 
2193
 
2194
- <p>The important parts to understand the implementation are below (for simplicity we consider a square shaped tile) : </p>
2195
 
2196
  <d-code block language="clike">
2197
  // Set pointers to the starting elements
@@ -2221,6 +2236,7 @@
2221
  }
2222
  C[localRow * N + localCol] = sum;
2223
  </d-code>
 
2224
 
2225
  <p>Each thread begins by loading one element from both <strong>Matrix A</strong> and <strong>Matrix B</strong> into shared memory. In this scenario, achieving coalesced memory access is straightforward, by assigning <code>threadIdx.x</code> as the <strong>local column index (localCol)</strong>, threads within the same warp will access adjacent elements of both matrices. After each thread in the block completes loading its elements into shared memory (ensured by calling <code>__syncthreads()</code>), they proceed to compute the dot product of the two tiles. Once the threads have iterated through all the tiles—horizontally for <strong>Matrix A</strong> and vertically for <strong>Matrix B</strong>—the resulting sum is stored in the corresponding location of <strong>Matrix C</strong>.</p>
2226
 
@@ -2236,11 +2252,13 @@
2236
  <p><img alt="image.png" src="/assets/images/threadcoarsening.png" /></p>
2237
 
2238
 
2239
- <p>The meaning of the states can be found in the <a href="https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference">Profiling Guide</a>, specifically in the <strong>Warp Stall Reasons</strong> section. There we can read that:</p>
2240
 
2241
  <p><em><code>"smsp__pcsamp_warps_issue_stalled_mio_throttle</code>: Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory accesses, trying to use fewer but wider loads can reduce pipeline pressure."</em></p>
2242
 
2243
- <p>So it seems warps are stalling waiting for shared memory accesses to return ! To resolve this issue we can apply the <strong>Thread Coarsening</strong> technique by merging several threads into a single coarsened thread, we can significantly reduce shared memory accesses because each coarsened thread can handle multiple output elements which would increase the arithmetic intensity of the kernel.</p>
 
 
2244
 
2245
  <h4>Minimizing Control Divergence</h4>
2246
 
@@ -2248,6 +2266,8 @@
2248
 
2249
  <p>Control divergence occurs when threads within the same warp take different execution paths. For instance, if a conditional statement (like an <code>if</code> statement) leads to some threads executing one block of code while others execute a different block, the warp must serialize these executions, resulting in idle threads waiting for others to complete. To minimize control divergence, we need to design kernels to ensure that threads within the same warp follow the same execution path. This can be achieved by restructuring code to reduce branching, using data structures that ensure all threads follow similar execution paths, or employing techniques such as predication.</p>
2250
 
 
 
2251
  <p>We have covered some of the main considerations when writing custom kernels and improving the performance and memory footprint of GPU operations. But there’s one more important concept before moving to a real example which is “fusing kernels”.</p>
2252
 
2253
  <h3>Fused Kernels</h3>
 
1681
  <p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and can be combined with both Pipeline Parallelism and ZeRO-3 as it relies on the distributive property of matrix multiplications which allows weights and activations to be sharded and computed independently before being combined.</p>
1682
 
1683
  <div class="large-image-background">
1684
+ <img alt="TP & SP diagram" src="/assets/images/5d_nutshell_tp_sp.svg" style="width: 1200px; max-width: none;" />
1685
  </div>
1686
 
1687
 
 
1828
  </tbody>
1829
  </table>
1830
 
1831
+ <p>Clearly, none of these techniques is a silver bullet for magical scaling and we'll often have to combine them in one way or another. Can we actually come up with a few rules that would help us find a good starting point to choose among –and combine them? This will be the topic of our next section.</p>
1832
 
1833
  <h2>How to Find the Best Training Configuration</h2>
1834
 
 
1838
 
1839
  <h3>Step 1: Fitting a Training Step in Memory</h3>
1840
 
1841
+ <p>First, we need to figure out how we can fit a full model instance on our GPUs. There are two general cases.</p>
1842
+
1843
+ <p><strong>GPU-rich case 🤑</strong> - when you have plenty of GPUs available:
1844
  <ul>
1845
  <li>For models under 10B parameters, you can use a single parallelism technique, e.g. Tensor Parallelism or ZeRO-3/DP with Full Recompute across 8 GPUs</li>
1846
  <li>For models between 10B-100B parameters requiring more than 8 GPUs, you have several options:</li>
 
1852
  <li>At 512+ GPU scale, pure Data Parallelism/ZeRO-3 will start to becomes inefficient due to communication cost - it can be better to then combine DP with either Tensor or Pipeline Parallelism</li>
1853
  <li>At 1024+ GPU scale, a recommended setup can be Tensor Parallelism TP=8 with Data Parallelism (ZeRO-2) and Pipeline Parallelism</li>
1854
  </ul>
1855
+ <aside>We focus on fitting a single instance for now - even though we may use DP for ZeRO to achieve this goal - we're only interested here in the model-parameters memory savings that it provide when used with ZeRO-3.</aside>
1856
+ </p>
1857
 
1858
+ <p>Special considerations:
1859
  <ul>
1860
  <li>For very long sequences, you will probably want to add Context Parallelism (CP) across nodes.</li>
1861
  <li>For Mixture of Experts architectures, you will advantageously use Expert Parallelism (EP) across nodes.</li>
1862
  </ul>
1863
+ </p>
1864
+ <p><strong>GPU-poor case 😭</strong> - when you might be low on GPU resources:
1865
  <ul>
1866
  <li>You can enable full activation recomputation to trade some compute for memory (and train a bit slower).</li>
1867
  <li>You can increase gradient accumulation to process larger batches with limited memory.
1868
  </li>
1869
  </ul>
1870
+ </p>
1871
  <p>Now that we have a first model instance training, we need to make sure we have the right batch size.</p>
1872
 
1873
  <h3>Step 2: Achieving Target Global Batch Size </h3>
 
1905
  <p><img alt="image.png" src="/assets/images/placeholder.png" /></p>
1906
  -->
1907
 
1908
+ <!-- <p>This concludes our very deep dive into 5D parallelism. However, besides scaling our model efficiently across GPUs there is another way to improve model throughput and memory management. It involves a better understanding of how GPU operate at a low level and is among the necessary knowledge to be able to take maximal advantage of large GPU clusters.</p> -->
1909
+
1910
+
1911
+ <hr>
1912
+
1913
+ <p>This concludes our very deep dive into the distribution methods of 5D parallelism. </p>
1914
+
1915
+
1916
+ <p>Taking a step back, our discussion so far has often relied on a critical assumption - that computation and communication can be efficiently overlapped on GPUs without any impact on the computation throughput. The reality is more nuanced. When using common communication primitives like NCCL send/recv, we face hidden contention between computation and communication resources as communication kernels will usually make use of the same GPU streaming multiprocessors (SMs) that are used for computation, leading to decreased throughput when communication is overlapped with computation. To truly optimize our distributed training, we need to dive deeper into the GPU architecture itself.</p>
1917
+ <aside>Additionally, the synchronization patterns when overlapping computation and communication may not always be optimal for our parallel strategies. You can find an example for instance in <a target="_blank" href="https://discuss.pytorch.org/t/distributed-w-torchtitan-introducing-async-tensor-parallelism-in-pytorch/209487">this blog post</a> by the Pytorch team.</aside>
1918
+
1919
 
1920
  <p>Time to turn the lights off and activate CUDA mode! </p>
1921
 
 
2160
  </div>
2161
 
2162
 
2163
+ <p>The reason for this is that in this kernel, two threads in the same block with Thread IDs <code>(0, 0)</code> and <code>(1, 0)</code> (which will end up in the same warp) will both load from the same column of matrix <code>B</code> but different rows of matrix <code>A</code>. Since matrix elements are stored in row-major order (meaning row elements are in consecutive memory addresses, as shown in the figure below) thread <code>(0, 0)</code> will load <d-math>A_{0,0}</d-math>, and thread <code>(1, 0)</code> will load <d-math>A_{1,0}</d-math> in the first iteration <code>i = 0</code>. These elements are not stored close to each other in memory, and this misalignment will be present at each iteration, thereby preventing memory accesses from being coalesced.</p>
2164
 
2165
  <p><img alt="image.png" src="/assets/images/memorycoalescing4.png" /></p>
2166
 
2167
 
2168
+ <p>To improve the performances of our kernel we can change the way coordinates <code></code>x</code> and <code>y</code> are calculated to the following: </p>
2169
 
2170
  <d-code block language="clike">
2171
  const int x = blockIdx.x * BLOCKSIZE + (threadIdx.x / BLOCKSIZE);
 
2184
 
2185
  <p>When we profile our new kernel, we notice that the warning about uncoalesced memory accesses has disappeared, and <strong>the GPU's memory throughput has increased by approximately 10 times</strong>.</p>
2186
 
2187
+ <div class="large-image-background">
2188
+ <p><img width="1200px" alt="image.png" src="/assets/images/memorycoalescing5.png" /></p>
2189
+ </div>
2190
 
2191
+ <p>We also notice that the execution time of the kernel <strong>decreases by 10x</strong>! Amazing.</p>
2192
+ <p>Now let's cover another technique you will often see mentioned in the litterature: <strong>tiling</strong>.</p>
2193
 
2194
 
2195
  <h4>Tiling</h4>
 
2199
 
2200
  <p>In matrix multiplication for example, each thread in a block may need elements from two matrices, say A and B. If each thread independently loads the row and column it needs from global memory, we end up with many redundant loads, as multiple threads in a block will access overlapping data. Instead, we can use tiling to load a block (or tile) of A and B into shared memory just once, allowing all threads in that block to reuse the same shared data.</p>
2201
 
2202
+ <p>In the tiling approach, each iteration involves all threads within a block to cooperatively load two tiles—one from matrix A and another from matrix B —into shared memory. Specifically, threads load a tile of matrix A (of size <code>BLOCK_SIZE_M</code> by <code>BLOCK_SIZE_K</code>) and a tile of matrix B (of size <code>BLOCK_SIZE_K</code> by <code>BLOCK_SIZE_N</code>). Once the tiles are in shared memory, the threads perform matrix multiplication on these tiles, enabling efficient computation since all necessary data is quickly accessible. The results of the tile multiplication are stored in an accumulation matrix that holds intermediate results. After each iteration, the results from the current tile multiplication are added to this accumulation matrix, continuing until all tiles from both matrices have been processed.</p>
2203
 
2204
  <p><img alt="image.png" src="/assets/images/tiling.png" /></p>
2205
+ <div class="figure-legend">
2206
+ <p>From <a target="_blank" href="https://cnugteren.github.io/tutorial/pages/page4.html">https://cnugteren.github.io/tutorial/pages/page4.html</a></p>
2207
+ </div>
2208
 
2209
+ <p>Let's take a look at the important parts you need to understand from the implementation: </p>
2210
 
2211
  <d-code block language="clike">
2212
  // Set pointers to the starting elements
 
2236
  }
2237
  C[localRow * N + localCol] = sum;
2238
  </d-code>
2239
+ <aside>For simplicity we consider a square shaped tile.</aside>
2240
 
2241
  <p>Each thread begins by loading one element from both <strong>Matrix A</strong> and <strong>Matrix B</strong> into shared memory. In this scenario, achieving coalesced memory access is straightforward, by assigning <code>threadIdx.x</code> as the <strong>local column index (localCol)</strong>, threads within the same warp will access adjacent elements of both matrices. After each thread in the block completes loading its elements into shared memory (ensured by calling <code>__syncthreads()</code>), they proceed to compute the dot product of the two tiles. Once the threads have iterated through all the tiles—horizontally for <strong>Matrix A</strong> and vertically for <strong>Matrix B</strong>—the resulting sum is stored in the corresponding location of <strong>Matrix C</strong>.</p>
2242
 
 
2252
  <p><img alt="image.png" src="/assets/images/threadcoarsening.png" /></p>
2253
 
2254
 
2255
+ <p>The meaning of these cryptic state names can be found in <a href="https://docs.nvidia.com/nsight-compute/ProfilingGuide/index.html#metrics-reference">NVidia's profiling Guide</a>, in the <strong>Warp Stall Reasons</strong> section. There we can read that:</p>
2256
 
2257
  <p><em><code>"smsp__pcsamp_warps_issue_stalled_mio_throttle</code>: Warp was stalled waiting for the MIO (memory input/output) instruction queue to be not full. This stall reason is high in cases of extreme utilization of the MIO pipelines, which include special math instructions, dynamic branches, as well as shared memory instructions. When caused by shared memory accesses, trying to use fewer but wider loads can reduce pipeline pressure."</em></p>
2258
 
2259
+ <p>So it seems warps are stalling waiting for shared memory accesses to return! To solve this issue we can apply a technique called <strong>Thread Coarsening</strong> which involves merging several threads into a single coarsened thread. This will significantly reduce shared memory accesses as each coarsened thread can handle multiple output elements.</p>
2260
+
2261
+ <p>Let's briefly mentionned a last important consideration when writing or improving custom kernels: <strong>Minimizing Control Divergence</strong>.</p>
2262
 
2263
  <h4>Minimizing Control Divergence</h4>
2264
 
 
2266
 
2267
  <p>Control divergence occurs when threads within the same warp take different execution paths. For instance, if a conditional statement (like an <code>if</code> statement) leads to some threads executing one block of code while others execute a different block, the warp must serialize these executions, resulting in idle threads waiting for others to complete. To minimize control divergence, we need to design kernels to ensure that threads within the same warp follow the same execution path. This can be achieved by restructuring code to reduce branching, using data structures that ensure all threads follow similar execution paths, or employing techniques such as predication.</p>
2268
 
2269
+ <hr>
2270
+
2271
  <p>We have covered some of the main considerations when writing custom kernels and improving the performance and memory footprint of GPU operations. But there’s one more important concept before moving to a real example which is “fusing kernels”.</p>
2272
 
2273
  <h3>Fused Kernels</h3>
src/style.css CHANGED
@@ -436,3 +436,19 @@ d-code {
436
  display: flex;
437
  justify-content: center; /* This will center your image */
438
  }
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
436
  display: flex;
437
  justify-content: center; /* This will center your image */
438
  }
439
+
440
+ d-article li {
441
+ margin-bottom: 0.0em;
442
+ }
443
+
444
+ d-article ul ul {
445
+ margin-bottom: 0.0em;
446
+ }
447
+
448
+ d-article ol ol {
449
+ margin-bottom: 0.0em;
450
+ }
451
+
452
+ d-article hr {
453
+ grid-column: text;
454
+ }