Crystalcareai commited on
Commit
f4651f0
1 Parent(s): bb09a79

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +3 -3
README.md CHANGED
@@ -6,9 +6,9 @@
6
 
7
  # GemMoE: A New MoE Method via Branch Train Mix
8
 
9
- I would like to introduce GemMoE, a new Mixture of Experts (MoE) method that utilizes a custom implementation of Meta's Branch Train Mix MoE method. This approach, detailed in the research paper ["Branch Train Mix: A Novel Mixture-of-Experts Training Procedure"](https://arxiv.org/abs/2403.07816), has allowed me to create an efficient model that explores the potential of MoE architectures.
10
 
11
- GemMoE is a 4x8.5b MoE model, consisting of 4 experts that were trained separately and then combined using a custom fork of axolotl. This fork enabled me to freeze all experts and focus on training the router mechanism. The router was trained on 4 epochs of a self-discovered dataset and 2 epochs of TruthyDPO from Jon Durbin.
12
 
13
  One of the main differences between GemMoE and previous versions is the use of tokenization routing instead of semantic routing. This approach, similar to the one used in mixtral, results in improved VRAM usage and competitive performance for its size.
14
 
@@ -24,7 +24,7 @@ By utilizing this training procedure, GemMoE aims to achieve competitive results
24
 
25
  ## Collaboration and Open-Source Development
26
 
27
- GemMoE builds upon the work of researchers and developers from various organizations, including Meta, Hugging Face, and the broader open-source community. I would like to thank the researchers behind the Branch Train Mix method, as well as the developers of axolotl and the contributors to the datasets used in training GemMoE.
28
 
29
  ## Future Development
30
 
 
6
 
7
  # GemMoE: A New MoE Method via Branch Train Mix
8
 
9
+ I would like to introduce GemMoE, a new Mixture of Experts (MoE) method that utilizes a custom implementation of Meta's Branch Train Mix MoE method. This approach, detailed in the research paper ["Branch Train Mix: A Novel Mixture-of-Experts Training Procedure"](https://arxiv.org/abs/2403.07816), has allowed me to create an efficient model that overcomes the limitations of previous GemMoE models.
10
 
11
+ GemMoE is a 4x8.5b MoE model, consisting of 4 experts that were trained separately and then combined using a custom fork of axolotl. This fork enabled me to freeze all experts and focus on training the router mechanism. The router was trained on 4 epochs of my Self-Discover-MM dataset and 2 epochs of TruthyDPO from Jon Durbin.
12
 
13
  One of the main differences between GemMoE and previous versions is the use of tokenization routing instead of semantic routing. This approach, similar to the one used in mixtral, results in improved VRAM usage and competitive performance for its size.
14
 
 
24
 
25
  ## Collaboration and Open-Source Development
26
 
27
+ GemMoE builds upon the work of researchers and developers from various organizations, including Meta, Hugging Face, and the broader open-source community. I would like to thank the researchers behind the Branch Train Mix method, as well as the developers of axolotl and Jon Durbin for creating TruthyDPO.
28
 
29
  ## Future Development
30