NeuralNovel
commited on
Update README.md
Browse files
README.md
CHANGED
@@ -6,8 +6,10 @@ tags:
|
|
6 |
- moe
|
7 |
- merge
|
8 |
---
|
9 |
-
|
10 |
-
|
|
|
|
|
11 |
[Join our Discord!](https://discord.gg/CAfWPV82)
|
12 |
|
13 |
A model to test how MoE will route without square expansion.
|
@@ -47,6 +49,6 @@ Although MoEs provide benefits like efficient pretraining and faster inference c
|
|
47 |
|
48 |
If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.
|
49 |
|
50 |
-
|
51 |
## "Wait...but you called this a frankenMoE?"
|
52 |
The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
|
|
|
6 |
- moe
|
7 |
- merge
|
8 |
---
|
9 |
+
|
10 |
+
![image/png](https://cdn-uploads.huggingface.co/production/uploads/645cfe4603fc86c46b3e46d1/1Uoxp_Bl9UwF-1K4KzwVN.png)
|
11 |
+
|
12 |
+
# ConvexAI/Solutus-3x7B
|
13 |
[Join our Discord!](https://discord.gg/CAfWPV82)
|
14 |
|
15 |
A model to test how MoE will route without square expansion.
|
|
|
49 |
|
50 |
If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.
|
51 |
|
52 |
+
![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
|
53 |
## "Wait...but you called this a frankenMoE?"
|
54 |
The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
|