Spaces:
Build error
Mixture of Experts
This repo contains overrides and configs for training sparse Mixture of Experts (MoE) models with T5X. The existing setups and examples all use Flaxformer.
Training standard MoE architectures
If you are looking train a T5X variant of a popular Mesh Tensorflow MoE model (e.g. Switch Transformer or Sparsely-Gated Mixture-of-Experts) or adapt existing MoE models, then the easiest way to get started is to plug one of the (Flaxformer) model gin configs into the T5X Quickstart guide. To customize the default MoE models, you can override aspects of the underlying (Flaxformer) architecture gin config.
Using MoE in your existing model
Alternatively, if you already have your own existing T5X/Flaxformer model
architecture and wish to add MoE layers, you can directly use the
Flaxformer MoeLayer.
Currently, the MoeLayer is constrained to use
Flaxformer MlpBlock(s)
as experts. As a point of reference: MoeLayer(s) are integrated with the Flaxformer T5
architecture through the
SparseEncoder
and
SparseDecoder.
These classes allow us to interleave sparse MoE and dense MLP blocks through the
sparse_layout
attribute.
Expert routing mechanisms
A number of routing mechanisms are supported:
- Switch routing (or top-1 "tokens choose" routing) based on the Switch Transformer
- General Top-k "tokens choose" routing of the form used in Sparsely-Gated Mixture-of-Experts, Vision MoE, Designing Effective Sparse Expert Models and many other MoE works
- "Experts choose" routing introduced in Mixture-of-Experts with Expert Choice Routing
See the Flaxformer router codebase for details.