NeuralNovel commited on
Commit
7c604d7
·
verified ·
1 Parent(s): 7331bf3

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +5 -3
README.md CHANGED
@@ -6,8 +6,10 @@ tags:
6
  - moe
7
  - merge
8
  ---
9
- ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
10
- # Uninhibited.
 
 
11
  [Join our Discord!](https://discord.gg/CAfWPV82)
12
 
13
  A model to test how MoE will route without square expansion.
@@ -47,6 +49,6 @@ Although MoEs provide benefits like efficient pretraining and faster inference c
47
 
48
  If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.
49
 
50
-
51
  ## "Wait...but you called this a frankenMoE?"
52
  The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.
 
6
  - moe
7
  - merge
8
  ---
9
+
10
+ ![image/png](https://cdn-uploads.huggingface.co/production/uploads/645cfe4603fc86c46b3e46d1/1Uoxp_Bl9UwF-1K4KzwVN.png)
11
+
12
+ # ConvexAI/Solutus-3x7B
13
  [Join our Discord!](https://discord.gg/CAfWPV82)
14
 
15
  A model to test how MoE will route without square expansion.
 
49
 
50
  If all our tokens are sent to just a few popular experts, that will make training inefficient. In a normal MoE training, the gating network converges to mostly activate the same few experts. This self-reinforces as favored experts are trained quicker and hence selected more. To mitigate this, an auxiliary loss is added to encourage giving all experts equal importance. This loss ensures that all experts receive a roughly equal number of training examples. The following sections will also explore the concept of expert capacity, which introduces a threshold of how many tokens can be processed by an expert. In transformers, the auxiliary loss is exposed via the aux_loss parameter.
51
 
52
+ ![image/gif](https://cdn-uploads.huggingface.co/production/uploads/6589d7e6586088fd2784a12c/43v7GezlOGg2BYljbU5ge.gif)
53
  ## "Wait...but you called this a frankenMoE?"
54
  The difference between MoE and "frankenMoE" lies in the fact that the router layer in a model like the one on this repo is not trained simultaneously.