Spaces:
Running
Running
more edits (#27)
Browse files- update (5dfedb3447d2658f8b02374ed7e18850652afd38)
- assets/images/5D_nutshell_tp_sp.svg +3 -0
- assets/images/5d_nutshell_cp.svg +3 -0
- assets/images/5d_nutshell_ep.svg +3 -0
- dist/assets/images/5D_nutshell_tp_sp.svg +3 -0
- dist/assets/images/5d_nutshell_cp.svg +3 -0
- dist/assets/images/5d_nutshell_ep.svg +3 -0
- dist/index.html +29 -12
- dist/style.css +34 -6
- src/index.html +29 -12
- src/style.css +34 -6
assets/images/5D_nutshell_tp_sp.svg
ADDED
|
assets/images/5d_nutshell_cp.svg
ADDED
|
assets/images/5d_nutshell_ep.svg
ADDED
|
dist/assets/images/5D_nutshell_tp_sp.svg
ADDED
|
dist/assets/images/5d_nutshell_cp.svg
ADDED
|
dist/assets/images/5d_nutshell_ep.svg
ADDED
|
dist/index.html
CHANGED
@@ -1510,7 +1510,8 @@
|
|
1510 |
|
1511 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
1512 |
|
1513 |
-
<
|
|
|
1514 |
|
1515 |
|
1516 |
<p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
|
@@ -1520,7 +1521,9 @@
|
|
1520 |
|
1521 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1522 |
|
1523 |
-
<
|
|
|
|
|
1524 |
|
1525 |
|
1526 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
|
@@ -1533,14 +1536,15 @@
|
|
1533 |
<li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
|
1534 |
</ul>
|
1535 |
|
|
|
|
|
1536 |
<div class="note-box">
|
1537 |
-
|
1538 |
-
|
1539 |
-
|
1540 |
-
|
1541 |
-
|
1542 |
|
1543 |
-
<p>TODO: the text between the table and figueres is still a bit sparse.</p>
|
1544 |
|
1545 |
<table>
|
1546 |
<thead>
|
@@ -2578,13 +2582,26 @@
|
|
2578 |
year={2025},
|
2579 |
}</pre>
|
2580 |
</d-appendix>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2581 |
|
2582 |
<script>
|
2583 |
const article = document.querySelector('d-article');
|
2584 |
const toc = document.querySelector('d-contents');
|
2585 |
if (toc) {
|
2586 |
const headings = article.querySelectorAll('h2, h3, h4');
|
2587 |
-
let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
|
|
|
|
|
|
|
|
|
2588 |
let prevLevel = 0;
|
2589 |
|
2590 |
for (const el of headings) {
|
@@ -2606,7 +2623,7 @@
|
|
2606 |
}
|
2607 |
if (level === 0)
|
2608 |
ToC += '<div>' + link + '</div>';
|
2609 |
-
else
|
2610 |
ToC += '<li>' + link + '</li>';
|
2611 |
}
|
2612 |
|
@@ -2614,10 +2631,10 @@
|
|
2614 |
ToC += '</ul>'
|
2615 |
prevLevel--;
|
2616 |
}
|
2617 |
-
ToC += '</nav>';
|
2618 |
toc.innerHTML = ToC;
|
2619 |
toc.setAttribute('prerendered', 'true');
|
2620 |
-
const toc_links = document.querySelectorAll('d-contents > nav a');
|
2621 |
|
2622 |
window.addEventListener('scroll', (_event) => {
|
2623 |
if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
|
|
|
1510 |
|
1511 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
1512 |
|
1513 |
+
<div class="l-page"><img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" /></div>
|
1514 |
+
<!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
|
1515 |
|
1516 |
|
1517 |
<p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
|
|
|
1521 |
|
1522 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1523 |
|
1524 |
+
<div class="l-page"><img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" /></div>
|
1525 |
+
|
1526 |
+
<!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
|
1527 |
|
1528 |
|
1529 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
|
|
|
1536 |
<li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
|
1537 |
</ul>
|
1538 |
|
1539 |
+
<div class="l-page"><img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" /></div>
|
1540 |
+
|
1541 |
<div class="note-box">
|
1542 |
+
<p class="note-box-title">📝 Note</p>
|
1543 |
+
<p class="note-box-content">
|
1544 |
+
<p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
|
1545 |
+
</p>
|
1546 |
+
</div>
|
1547 |
|
|
|
1548 |
|
1549 |
<table>
|
1550 |
<thead>
|
|
|
2582 |
year={2025},
|
2583 |
}</pre>
|
2584 |
</d-appendix>
|
2585 |
+
<script>
|
2586 |
+
function toggleTOC() {
|
2587 |
+
const content = document.querySelector('.toc-content');
|
2588 |
+
const icon = document.querySelector('.toggle-icon');
|
2589 |
+
|
2590 |
+
content.classList.toggle('collapsed');
|
2591 |
+
icon.classList.toggle('collapsed');
|
2592 |
+
}
|
2593 |
+
</script>
|
2594 |
|
2595 |
<script>
|
2596 |
const article = document.querySelector('d-article');
|
2597 |
const toc = document.querySelector('d-contents');
|
2598 |
if (toc) {
|
2599 |
const headings = article.querySelectorAll('h2, h3, h4');
|
2600 |
+
// let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
|
2601 |
+
let ToC = `<nav role="navigation" class="l-text figcaption"><div class="toc-header" onclick="toggleTOC()">
|
2602 |
+
<span class="toc-title">Table of Contents</span>
|
2603 |
+
<span class="toggle-icon">▼</span>
|
2604 |
+
</div><div class="toc-content">`;
|
2605 |
let prevLevel = 0;
|
2606 |
|
2607 |
for (const el of headings) {
|
|
|
2623 |
}
|
2624 |
if (level === 0)
|
2625 |
ToC += '<div>' + link + '</div>';
|
2626 |
+
else if (level === 1)
|
2627 |
ToC += '<li>' + link + '</li>';
|
2628 |
}
|
2629 |
|
|
|
2631 |
ToC += '</ul>'
|
2632 |
prevLevel--;
|
2633 |
}
|
2634 |
+
ToC += '</div></nav>';
|
2635 |
toc.innerHTML = ToC;
|
2636 |
toc.setAttribute('prerendered', 'true');
|
2637 |
+
const toc_links = document.querySelectorAll('d-contents > nav div a');
|
2638 |
|
2639 |
window.addEventListener('scroll', (_event) => {
|
2640 |
if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
|
dist/style.css
CHANGED
@@ -150,6 +150,7 @@ d-contents > nav a.active {
|
|
150 |
@media (max-width: 1199px) {
|
151 |
d-contents {
|
152 |
display: none;
|
|
|
153 |
justify-self: start;
|
154 |
align-self: start;
|
155 |
padding-bottom: 0.5em;
|
@@ -160,7 +161,7 @@ d-contents > nav a.active {
|
|
160 |
border-bottom-style: solid;
|
161 |
border-bottom-color: rgba(0, 0, 0, 0.1);
|
162 |
overflow-y: scroll;
|
163 |
-
height: calc(100vh -
|
164 |
scrollbar-width: none;
|
165 |
z-index: -100;
|
166 |
}
|
@@ -170,6 +171,31 @@ d-contents a:hover {
|
|
170 |
border-bottom: none;
|
171 |
}
|
172 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
173 |
|
174 |
@media (min-width: 1200px) {
|
175 |
d-article {
|
@@ -179,6 +205,7 @@ d-contents a:hover {
|
|
179 |
|
180 |
d-contents {
|
181 |
align-self: start;
|
|
|
182 |
grid-column-start: 1 !important;
|
183 |
grid-column-end: 4 !important;
|
184 |
grid-row: auto / span 6;
|
@@ -186,16 +213,17 @@ d-contents a:hover {
|
|
186 |
margin-top: 0em;
|
187 |
padding-right: 3em;
|
188 |
padding-left: 2em;
|
189 |
-
border-right: 1px solid rgba(0, 0, 0, 0.1);
|
190 |
border-right-width: 1px;
|
191 |
border-right-style: solid;
|
192 |
-
border-right-color: rgba(0, 0, 0, 0.1);
|
193 |
position: -webkit-sticky; /* For Safari */
|
194 |
position: sticky;
|
195 |
top: 10px; /* Adjust this value if needed */
|
196 |
-
overflow-y:
|
197 |
-
height: calc(100vh -
|
198 |
scrollbar-width: none;
|
|
|
199 |
z-index: -100;
|
200 |
}
|
201 |
}
|
@@ -205,7 +233,7 @@ d-contents nav h3 {
|
|
205 |
margin-bottom: 1em;
|
206 |
}
|
207 |
|
208 |
-
d-contents nav div {
|
209 |
color: rgba(0, 0, 0, 0.8);
|
210 |
font-weight: bold;
|
211 |
}
|
|
|
150 |
@media (max-width: 1199px) {
|
151 |
d-contents {
|
152 |
display: none;
|
153 |
+
background: white;
|
154 |
justify-self: start;
|
155 |
align-self: start;
|
156 |
padding-bottom: 0.5em;
|
|
|
161 |
border-bottom-style: solid;
|
162 |
border-bottom-color: rgba(0, 0, 0, 0.1);
|
163 |
overflow-y: scroll;
|
164 |
+
height: calc(100vh - 40px);
|
165 |
scrollbar-width: none;
|
166 |
z-index: -100;
|
167 |
}
|
|
|
171 |
border-bottom: none;
|
172 |
}
|
173 |
|
174 |
+
toc-title {
|
175 |
+
font-weight: bold;
|
176 |
+
font-size: 1.2em;
|
177 |
+
color: #333;
|
178 |
+
}
|
179 |
+
|
180 |
+
toggle-icon {
|
181 |
+
transition: transform 0.3s;
|
182 |
+
}
|
183 |
+
|
184 |
+
toggle-icon.collapsed {
|
185 |
+
transform: rotate(-90deg);
|
186 |
+
}
|
187 |
+
|
188 |
+
.toc-content {
|
189 |
+
margin-top: 15px;
|
190 |
+
overflow: hidden;
|
191 |
+
max-height: 1000px;
|
192 |
+
transition: max-height 0.3s ease-out;
|
193 |
+
}
|
194 |
+
|
195 |
+
.toc-content.collapsed {
|
196 |
+
max-height: 0;
|
197 |
+
margin-top: 0;
|
198 |
+
}
|
199 |
|
200 |
@media (min-width: 1200px) {
|
201 |
d-article {
|
|
|
205 |
|
206 |
d-contents {
|
207 |
align-self: start;
|
208 |
+
background: white;
|
209 |
grid-column-start: 1 !important;
|
210 |
grid-column-end: 4 !important;
|
211 |
grid-row: auto / span 6;
|
|
|
213 |
margin-top: 0em;
|
214 |
padding-right: 3em;
|
215 |
padding-left: 2em;
|
216 |
+
/* border-right: 1px solid rgba(0, 0, 0, 0.1);
|
217 |
border-right-width: 1px;
|
218 |
border-right-style: solid;
|
219 |
+
border-right-color: rgba(0, 0, 0, 0.1); */
|
220 |
position: -webkit-sticky; /* For Safari */
|
221 |
position: sticky;
|
222 |
top: 10px; /* Adjust this value if needed */
|
223 |
+
overflow-y: auto;
|
224 |
+
height: calc(100vh - 40px);
|
225 |
scrollbar-width: none;
|
226 |
+
transition: max-height 0.3s ease-out;
|
227 |
z-index: -100;
|
228 |
}
|
229 |
}
|
|
|
233 |
margin-bottom: 1em;
|
234 |
}
|
235 |
|
236 |
+
d-contents nav div div {
|
237 |
color: rgba(0, 0, 0, 0.8);
|
238 |
font-weight: bold;
|
239 |
}
|
src/index.html
CHANGED
@@ -1510,7 +1510,8 @@
|
|
1510 |
|
1511 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
1512 |
|
1513 |
-
<
|
|
|
1514 |
|
1515 |
|
1516 |
<p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
|
@@ -1520,7 +1521,9 @@
|
|
1520 |
|
1521 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1522 |
|
1523 |
-
<
|
|
|
|
|
1524 |
|
1525 |
|
1526 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
|
@@ -1533,14 +1536,15 @@
|
|
1533 |
<li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
|
1534 |
</ul>
|
1535 |
|
|
|
|
|
1536 |
<div class="note-box">
|
1537 |
-
|
1538 |
-
|
1539 |
-
|
1540 |
-
|
1541 |
-
|
1542 |
|
1543 |
-
<p>TODO: the text between the table and figueres is still a bit sparse.</p>
|
1544 |
|
1545 |
<table>
|
1546 |
<thead>
|
@@ -2578,13 +2582,26 @@
|
|
2578 |
year={2025},
|
2579 |
}</pre>
|
2580 |
</d-appendix>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
2581 |
|
2582 |
<script>
|
2583 |
const article = document.querySelector('d-article');
|
2584 |
const toc = document.querySelector('d-contents');
|
2585 |
if (toc) {
|
2586 |
const headings = article.querySelectorAll('h2, h3, h4');
|
2587 |
-
let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
|
|
|
|
|
|
|
|
|
2588 |
let prevLevel = 0;
|
2589 |
|
2590 |
for (const el of headings) {
|
@@ -2606,7 +2623,7 @@
|
|
2606 |
}
|
2607 |
if (level === 0)
|
2608 |
ToC += '<div>' + link + '</div>';
|
2609 |
-
else
|
2610 |
ToC += '<li>' + link + '</li>';
|
2611 |
}
|
2612 |
|
@@ -2614,10 +2631,10 @@
|
|
2614 |
ToC += '</ul>'
|
2615 |
prevLevel--;
|
2616 |
}
|
2617 |
-
ToC += '</nav>';
|
2618 |
toc.innerHTML = ToC;
|
2619 |
toc.setAttribute('prerendered', 'true');
|
2620 |
-
const toc_links = document.querySelectorAll('d-contents > nav a');
|
2621 |
|
2622 |
window.addEventListener('scroll', (_event) => {
|
2623 |
if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
|
|
|
1510 |
|
1511 |
<p><strong>Tensor Parallelism</strong> (with Sequence Parallelism) is naturally complementary and interoperable with both Pipeline Parallelism and ZeRO-3, because it relies on the distributive property of matrix multiplication that allows weights and activations to be sharded and computed independently before being combined. However, TP has two important limitations: First, since its communication operations are part of the critical path of computation, it doesn't scale well beyond a certain point as communication overhead begins to dominate. Second, unlike ZeRO and PP which are model-agnostic, TP requires careful handling of activation sharding - sometimes along the hidden dimension (in the TP region) and sometimes along the sequence dimension (in the SP region) - making it more complex to implement correctly and requiring model-specific knowledge to ensure proper sharding patterns throughout.</p>
|
1512 |
|
1513 |
+
<div class="l-page"><img alt="TP & SP diagram" src="/assets/images/5D_nutshell_tp_sp.svg" /></div>
|
1514 |
+
<!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
|
1515 |
|
1516 |
|
1517 |
<p>When combining parallelism strategies, TP will typically be kept for high-speed intra-node communications while ZeRO-3 or PP can use parallelism groups spanning lower speed inter-node communications, since their communication patterns are more amenable to scaling. The main consideration is organizing the GPU groups efficiently for each parallelism dimension to maximize throughput and minimize communication overhead, while being mindful of TP's scaling limitations.</p>
|
|
|
1521 |
|
1522 |
<p><strong>Context Parallelism (CP)</strong> specifically targets the challenge of training with very long sequences by sharding activations along the sequence dimension across GPUs. While most operations like MLPs and LayerNorm can process these sharded sequences independently, attention layers require communication since each token needs access to keys/values from the full sequence. This is handled efficiently through ring attention patterns that overlap computation and communication. CP is particularly valuable when scaling to extreme sequence lengths (128k+ tokens) where even with full activation recomputation the memory requirements for attention would be prohibitive on a single GPU.</p>
|
1523 |
|
1524 |
+
<div class="l-page"><img alt="CP diagram" src="/assets/images/5d_nutshell_cp.svg" /></div>
|
1525 |
+
|
1526 |
+
<!-- <p><img alt="image.png" src="/assets/images/placeholder.png" /></p> -->
|
1527 |
|
1528 |
|
1529 |
<p><strong>Expert Parallelism (EP)</strong> specifically targets the challenge of training Mixture of Experts (MoE) models by sharding specialized "experts" across GPUs and dynamically routing tokens to relevant experts during computation. The key communication pattern in EP is the all-to-all operation needed to route tokens to their assigned experts and gather the results back. While this introduces some communication overhead, it enables scaling model capacity significantly since each token only needs to compute through a fraction of the total parameters. This partitioning of experts across GPUs becomes essential when working with models that have a large number of experts, like DeepSeek which uses 256 experts.</p>
|
|
|
1536 |
<li>Expert Parallelism primarly affects the MoE layers (which replace standard MLP blocks), leaving attention and other components unchanged</li>
|
1537 |
</ul>
|
1538 |
|
1539 |
+
<div class="l-page"><img alt="EP diagram" src="/assets/images/5d_nutshell_ep.svg" /></div>
|
1540 |
+
|
1541 |
<div class="note-box">
|
1542 |
+
<p class="note-box-title">📝 Note</p>
|
1543 |
+
<p class="note-box-content">
|
1544 |
+
<p>This similarity between EP and DP in terms of input handling is why some implementations consider Expert Parallelism to be a subgroup of Data Parallelism, with the key difference being that EP uses specialized expert routing rather than having all GPUs process inputs through identical model copies.
|
1545 |
+
</p>
|
1546 |
+
</div>
|
1547 |
|
|
|
1548 |
|
1549 |
<table>
|
1550 |
<thead>
|
|
|
2582 |
year={2025},
|
2583 |
}</pre>
|
2584 |
</d-appendix>
|
2585 |
+
<script>
|
2586 |
+
function toggleTOC() {
|
2587 |
+
const content = document.querySelector('.toc-content');
|
2588 |
+
const icon = document.querySelector('.toggle-icon');
|
2589 |
+
|
2590 |
+
content.classList.toggle('collapsed');
|
2591 |
+
icon.classList.toggle('collapsed');
|
2592 |
+
}
|
2593 |
+
</script>
|
2594 |
|
2595 |
<script>
|
2596 |
const article = document.querySelector('d-article');
|
2597 |
const toc = document.querySelector('d-contents');
|
2598 |
if (toc) {
|
2599 |
const headings = article.querySelectorAll('h2, h3, h4');
|
2600 |
+
// let ToC = `<nav role="navigation" class="l-text figcaption"><h3>Table of contents</h3>`;
|
2601 |
+
let ToC = `<nav role="navigation" class="l-text figcaption"><div class="toc-header" onclick="toggleTOC()">
|
2602 |
+
<span class="toc-title">Table of Contents</span>
|
2603 |
+
<span class="toggle-icon">▼</span>
|
2604 |
+
</div><div class="toc-content">`;
|
2605 |
let prevLevel = 0;
|
2606 |
|
2607 |
for (const el of headings) {
|
|
|
2623 |
}
|
2624 |
if (level === 0)
|
2625 |
ToC += '<div>' + link + '</div>';
|
2626 |
+
else if (level === 1)
|
2627 |
ToC += '<li>' + link + '</li>';
|
2628 |
}
|
2629 |
|
|
|
2631 |
ToC += '</ul>'
|
2632 |
prevLevel--;
|
2633 |
}
|
2634 |
+
ToC += '</div></nav>';
|
2635 |
toc.innerHTML = ToC;
|
2636 |
toc.setAttribute('prerendered', 'true');
|
2637 |
+
const toc_links = document.querySelectorAll('d-contents > nav div a');
|
2638 |
|
2639 |
window.addEventListener('scroll', (_event) => {
|
2640 |
if (typeof (headings) != 'undefined' && headings != null && typeof (toc_links) != 'undefined' && toc_links != null) {
|
src/style.css
CHANGED
@@ -150,6 +150,7 @@ d-contents > nav a.active {
|
|
150 |
@media (max-width: 1199px) {
|
151 |
d-contents {
|
152 |
display: none;
|
|
|
153 |
justify-self: start;
|
154 |
align-self: start;
|
155 |
padding-bottom: 0.5em;
|
@@ -160,7 +161,7 @@ d-contents > nav a.active {
|
|
160 |
border-bottom-style: solid;
|
161 |
border-bottom-color: rgba(0, 0, 0, 0.1);
|
162 |
overflow-y: scroll;
|
163 |
-
height: calc(100vh -
|
164 |
scrollbar-width: none;
|
165 |
z-index: -100;
|
166 |
}
|
@@ -170,6 +171,31 @@ d-contents a:hover {
|
|
170 |
border-bottom: none;
|
171 |
}
|
172 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
173 |
|
174 |
@media (min-width: 1200px) {
|
175 |
d-article {
|
@@ -179,6 +205,7 @@ d-contents a:hover {
|
|
179 |
|
180 |
d-contents {
|
181 |
align-self: start;
|
|
|
182 |
grid-column-start: 1 !important;
|
183 |
grid-column-end: 4 !important;
|
184 |
grid-row: auto / span 6;
|
@@ -186,16 +213,17 @@ d-contents a:hover {
|
|
186 |
margin-top: 0em;
|
187 |
padding-right: 3em;
|
188 |
padding-left: 2em;
|
189 |
-
border-right: 1px solid rgba(0, 0, 0, 0.1);
|
190 |
border-right-width: 1px;
|
191 |
border-right-style: solid;
|
192 |
-
border-right-color: rgba(0, 0, 0, 0.1);
|
193 |
position: -webkit-sticky; /* For Safari */
|
194 |
position: sticky;
|
195 |
top: 10px; /* Adjust this value if needed */
|
196 |
-
overflow-y:
|
197 |
-
height: calc(100vh -
|
198 |
scrollbar-width: none;
|
|
|
199 |
z-index: -100;
|
200 |
}
|
201 |
}
|
@@ -205,7 +233,7 @@ d-contents nav h3 {
|
|
205 |
margin-bottom: 1em;
|
206 |
}
|
207 |
|
208 |
-
d-contents nav div {
|
209 |
color: rgba(0, 0, 0, 0.8);
|
210 |
font-weight: bold;
|
211 |
}
|
|
|
150 |
@media (max-width: 1199px) {
|
151 |
d-contents {
|
152 |
display: none;
|
153 |
+
background: white;
|
154 |
justify-self: start;
|
155 |
align-self: start;
|
156 |
padding-bottom: 0.5em;
|
|
|
161 |
border-bottom-style: solid;
|
162 |
border-bottom-color: rgba(0, 0, 0, 0.1);
|
163 |
overflow-y: scroll;
|
164 |
+
height: calc(100vh - 40px);
|
165 |
scrollbar-width: none;
|
166 |
z-index: -100;
|
167 |
}
|
|
|
171 |
border-bottom: none;
|
172 |
}
|
173 |
|
174 |
+
toc-title {
|
175 |
+
font-weight: bold;
|
176 |
+
font-size: 1.2em;
|
177 |
+
color: #333;
|
178 |
+
}
|
179 |
+
|
180 |
+
toggle-icon {
|
181 |
+
transition: transform 0.3s;
|
182 |
+
}
|
183 |
+
|
184 |
+
toggle-icon.collapsed {
|
185 |
+
transform: rotate(-90deg);
|
186 |
+
}
|
187 |
+
|
188 |
+
.toc-content {
|
189 |
+
margin-top: 15px;
|
190 |
+
overflow: hidden;
|
191 |
+
max-height: 1000px;
|
192 |
+
transition: max-height 0.3s ease-out;
|
193 |
+
}
|
194 |
+
|
195 |
+
.toc-content.collapsed {
|
196 |
+
max-height: 0;
|
197 |
+
margin-top: 0;
|
198 |
+
}
|
199 |
|
200 |
@media (min-width: 1200px) {
|
201 |
d-article {
|
|
|
205 |
|
206 |
d-contents {
|
207 |
align-self: start;
|
208 |
+
background: white;
|
209 |
grid-column-start: 1 !important;
|
210 |
grid-column-end: 4 !important;
|
211 |
grid-row: auto / span 6;
|
|
|
213 |
margin-top: 0em;
|
214 |
padding-right: 3em;
|
215 |
padding-left: 2em;
|
216 |
+
/* border-right: 1px solid rgba(0, 0, 0, 0.1);
|
217 |
border-right-width: 1px;
|
218 |
border-right-style: solid;
|
219 |
+
border-right-color: rgba(0, 0, 0, 0.1); */
|
220 |
position: -webkit-sticky; /* For Safari */
|
221 |
position: sticky;
|
222 |
top: 10px; /* Adjust this value if needed */
|
223 |
+
overflow-y: auto;
|
224 |
+
height: calc(100vh - 40px);
|
225 |
scrollbar-width: none;
|
226 |
+
transition: max-height 0.3s ease-out;
|
227 |
z-index: -100;
|
228 |
}
|
229 |
}
|
|
|
233 |
margin-bottom: 1em;
|
234 |
}
|
235 |
|
236 |
+
d-contents nav div div {
|
237 |
color: rgba(0, 0, 0, 0.8);
|
238 |
font-weight: bold;
|
239 |
}
|