Mixture of Experts

This repo contains overrides and configs for training sparse Mixture of Experts (MoE) models with T5X. The existing setups and examples all use Flaxformer.

Training standard MoE architectures

If you are looking train a T5X variant of a popular Mesh Tensorflow MoE model (e.g. Switch Transformer or Sparsely-Gated Mixture-of-Experts) or adapt existing MoE models, then the easiest way to get started is to plug one of the (Flaxformer) model gin configs into the T5X Quickstart guide. To customize the default MoE models, you can override aspects of the underlying (Flaxformer) architecture gin config.

Using MoE in your existing model

Alternatively, if you already have your own existing T5X/Flaxformer model architecture and wish to add MoE layers, you can directly use the Flaxformer MoeLayer. Currently, the MoeLayer is constrained to use Flaxformer MlpBlock(s) as experts. As a point of reference: MoeLayer(s) are integrated with the Flaxformer T5 architecture through the SparseEncoder and SparseDecoder. These classes allow us to interleave sparse MoE and dense MLP blocks through the sparse_layout attribute.

Expert routing mechanisms

A number of routing mechanisms are supported:

Switch routing (or top-1 "tokens choose" routing) based on the Switch Transformer
General Top-k "tokens choose" routing of the form used in Sparsely-Gated Mixture-of-Experts, Vision MoE, Designing Effective Sparse Expert Models and many other MoE works
"Experts choose" routing introduced in Mixture-of-Experts with Expert Choice Routing

See the Flaxformer router codebase for details.