# Mixture of Experts This repo contains overrides and configs for training sparse Mixture of Experts (MoE) models with T5X. The existing setups and examples all use [Flaxformer](https://github.com/google/flaxformer). ## Training standard MoE architectures If you are looking train a T5X variant of a popular Mesh Tensorflow MoE model (e.g. [Switch Transformer](https://arxiv.org/abs/2101.03961) or [Sparsely-Gated Mixture-of-Experts](https://arxiv.org/abs/1701.06538)) or adapt existing MoE models, then the easiest way to get started is to plug one of the [(Flaxformer) model gin configs](https://github.com/google/flaxformer/tree/main/flaxformer/t5x/configs/moe/models) into the [T5X Quickstart guide](https://github.com/google-research/t5x). To customize the default MoE models, you can override aspects of the underlying [(Flaxformer) architecture gin config](https://github.com/google/flaxformer/blob/main/flaxformer/t5x/configs/moe/architectures/moe.gin). ## Using MoE in your existing model Alternatively, if you already have your own existing T5X/Flaxformer model architecture and wish to add MoE layers, you can directly use the [Flaxformer MoeLayer](https://github.com/google/flaxformer/blob/b725bd2a51d70e866d819c92de166fbf24425e6a/flaxformer/architectures/moe/moe_layers.py#L67). Currently, the MoeLayer is constrained to use [Flaxformer MlpBlock(s)](https://github.com/google/flaxformer/blob/b725bd2a51d70e866d819c92de166fbf24425e6a/flaxformer/components/dense.py#L185) as experts. As a point of reference: MoeLayer(s) are integrated with the Flaxformer T5 architecture through the [SparseEncoder](https://github.com/google/flaxformer/blob/b725bd2a51d70e866d819c92de166fbf24425e6a/flaxformer/architectures/moe/moe_architecture.py#L36) and [SparseDecoder](https://github.com/google/flaxformer/blob/b725bd2a51d70e866d819c92de166fbf24425e6a/flaxformer/architectures/moe/moe_architecture.py#L162). These classes allow us to interleave sparse MoE and dense MLP blocks through the `sparse_layout` attribute. ## Expert routing mechanisms A number of routing mechanisms are supported: * Switch routing (or top-1 "tokens choose" routing) based on the [Switch Transformer](https://arxiv.org/abs/2101.03961) * General Top-k "tokens choose" routing of the form used in [Sparsely-Gated Mixture-of-Experts](https://arxiv.org/abs/1701.06538), [Vision MoE](https://arxiv.org/abs/2106.05974), [Designing Effective Sparse Expert Models](https://arxiv.org/abs/2202.08906) and many other MoE works * "Experts choose" routing introduced in [Mixture-of-Experts with Expert Choice Routing](https://arxiv.org/abs/2202.09368) See the [Flaxformer router codebase](https://github.com/google/flaxformer/blob/main/flaxformer/architectures/moe/routing.py) for details.