Spaces:
Running
Running
Commit
Β·
1d7bb53
1
Parent(s):
6edf7e0
- dist/index.html +4 -2
- src/index.html +4 -2
dist/index.html
CHANGED
@@ -1077,7 +1077,7 @@
|
|
1077 |
<tbody>
|
1078 |
<tr>
|
1079 |
<td>Embedding Layer (Row Linear sharded on vocab)</td>
|
1080 |
-
<td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s:
|
1081 |
<td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
|
1082 |
</tr>
|
1083 |
</tbody>
|
@@ -1436,12 +1436,14 @@
|
|
1436 |
<h2>Expert parallelism</h2>
|
1437 |
<p>One more <s>thing</s> parallelism.</p>
|
1438 |
|
|
|
|
|
1439 |
<p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
|
1440 |
|
1441 |
<p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
|
1442 |
<p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
|
1443 |
|
1444 |
-
<p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert
|
1445 |
|
1446 |
<p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
|
1447 |
|
|
|
1077 |
<tbody>
|
1078 |
<tr>
|
1079 |
<td>Embedding Layer (Row Linear sharded on vocab)</td>
|
1080 |
+
<td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
|
1081 |
<td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
|
1082 |
</tr>
|
1083 |
</tbody>
|
|
|
1436 |
<h2>Expert parallelism</h2>
|
1437 |
<p>One more <s>thing</s> parallelism.</p>
|
1438 |
|
1439 |
+
<p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
|
1440 |
+
|
1441 |
<p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
|
1442 |
|
1443 |
<p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
|
1444 |
<p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
|
1445 |
|
1446 |
+
<p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
|
1447 |
|
1448 |
<p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
|
1449 |
|
src/index.html
CHANGED
@@ -1077,7 +1077,7 @@
|
|
1077 |
<tbody>
|
1078 |
<tr>
|
1079 |
<td>Embedding Layer (Row Linear sharded on vocab)</td>
|
1080 |
-
<td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s:
|
1081 |
<td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
|
1082 |
</tr>
|
1083 |
</tbody>
|
@@ -1436,12 +1436,14 @@
|
|
1436 |
<h2>Expert parallelism</h2>
|
1437 |
<p>One more <s>thing</s> parallelism.</p>
|
1438 |
|
|
|
|
|
1439 |
<p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
|
1440 |
|
1441 |
<p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
|
1442 |
<p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
|
1443 |
|
1444 |
-
<p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert
|
1445 |
|
1446 |
<p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
|
1447 |
|
|
|
1077 |
<tbody>
|
1078 |
<tr>
|
1079 |
<td>Embedding Layer (Row Linear sharded on vocab)</td>
|
1080 |
+
<td>h: full (weight_out is full + <strong>all-reduce</strong> for correctness)<br>s: full</td>
|
1081 |
<td>h: full (weight_out is full + <strong>reduce-scatter</strong> for correctness)<br>s: <strong>reduce-scatter</strong> to sharded</td>
|
1082 |
</tr>
|
1083 |
</tbody>
|
|
|
1436 |
<h2>Expert parallelism</h2>
|
1437 |
<p>One more <s>thing</s> parallelism.</p>
|
1438 |
|
1439 |
+
<p>Before diving into Expert Parallelism, we recommend reading about the Mixture-of-Experts (MoE) architecture in <a href="https://huggingface.co/blog/moe">this blog post</a> to better understand the concepts.</p>
|
1440 |
+
|
1441 |
<p>Mixture-of-expert models have gained some traction with models such as Mixtral<d-cite bibtex-key="jiang2024mixtralexperts"></d-cite> or more recently DeepSeek-V3/R1! The basic idea is that instead of having a single feedforward module per layer we can have several and route tokens through different ones depending on their context:</p>
|
1442 |
|
1443 |
<p><img alt="ep_schema.png" src="/assets/images/ep_schema.png" /></p>
|
1444 |
<p>Source: A Survey on Mixture of Experts<d-cite bibtex-key="cai2024surveymixtureexperts"></d-cite> </p>
|
1445 |
|
1446 |
+
<p>This design makes it very easy to add a new parallelism paradigm: Expert parallelism (EP). Since the feedforward layers are fully independent we can simply put each expert's feedforward layer on a different worker. Compared to TP it's much more lightweight, since we don't need to split the matrix multiplication, we just need to route the hidden states of a token to the right expert. There are several tricks to make EP work in practice, closely tied to model design. For instance, DeepSeek-V3 enforces a constraint in the router, ensuring that each token is sent to at most M nodes (in their case, 4) to reduce communication overhead.</p>
|
1447 |
|
1448 |
<p>While Expert parallelism has been around for a while<d-cite bibtex-key="lepikhin2020gshardscalinggiantmodels"></d-cite> it is just now gaining new traction with the MoE architecture gaining more traction. </p>
|
1449 |
|