MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts
Paper or resources for more information: [Paper] [Code]
โก Overview
We introduce three types of zero-computation experts: the zero expert, copy expert, and constant expert, which correspond to discard, skip, and replace operations, respectively. Moreover, we leverage gating residuals, enabling each token to consider the pathway taken in the previous layer when selecting the appropriate experts.
Download URL
HuggingFace Model | |
---|---|
MoE++7B-Base | ๐ค MoE++7B-Base |
MoE++7B-Chat | ๐ Coming Soon |
๐ฎ Highlights
๐ก Low Computing Overhead
For an MoE++ model, its computational complexity is always less than that of MoE models with the same number of parameters.
๐ฅ High Performance & High Throughput
Extensive experimental results demonstrate that MoE++ achieves better performance while delivering 1.1~2.1x expert forward throughput compared to a vanilla MoE model of the same size, which lays a solid foundation for developing advanced and efficient MoE-related models.
๐ค Deployment Friendly
Given that zero-computation experts have negligible parameters, we can deploy all zero-computation experts on each GPU, eliminating the significant communication overhead and expert load imbalance associated with FFN experts distributed across different GPUs.
๐ Why is MoE++ better than MoE?
Flexible Computation Allocation
MoE++ allows simple tokens to utilize fewer FFN experts, freeing up more FFN experts to focus on challenging tokens. This results in both Reduced Computation and Enhanced Performance.
- Verbs tend to activate a large number of FFN experts. For example, the verb "touch" activates an average of 1.77 FFN experts across all layers, approaching the upper limit of 2. This likely occurs because verbs often convey rich semantic information and frequently interact with nouns to form more complex semantic structures.
- Nouns typically activate a moderate number of FFN experts, with most nouns averaging between 1.5 and 1.7 FFN expert activations.
- Simple tokens with little semantic tend to activate a small number of FFN experts. For example, word fragments, such as "pper" and "ather", usually activate fewer than 1.5 FFN experts.
These findings confirm that MoE++ allows simple tokens to utilize fewer FFN experts, freeing up more FFN experts to focus on challenging tokens.
Stable Routing
Gating residuals effectively establish connections between different MoE++ layers and reduce the variance of routing scores. Meanwhile, the gating residuals do not change the mean and range of values of the routing scores. Consequently, gating residuals contribute to the stable routing of heterogeneous expert architectures in MoE++.
๐ค API for Base Model Inference
If you want to load the model from the model hub on Hugging Face or on local, you can use the following code snippets.
Base Model Inference
from transformers import AutoModelForCausalLM, AutoTokenizer
question = "Hello!"
model = AutoModelForCausalLM.from_pretrained("Chat-UniVi/MoE-Plus-Plus-7B", trust_remote_code=True, device_map='auto')
tokenizer = AutoTokenizer.from_pretrained("Chat-UniVi/MoE-Plus-Plus-7B", trust_remote_code=True)
inputs = tokenizer(question, return_tensors='pt').to(model.device)
response = model.generate(inputs.input_ids, max_length=128)
print(tokenizer.decode(response.cpu()[0], skip_special_tokens=True))
Chat Model Inference
Coming soon...
๐๏ธ Training & Validating
- The training code is built on Skywork-MoE. Unless Skywork-MoE is open source, we can't open source MoE++ alone. We will release the training code after the approval is completed.
- The evaluation is performed on multiple key benchmarks using the Eleuther AI Language Model Evaluation Harness.
# For example, test MoE++ on winogrande
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 accelerate launch \
--main_process_port 2004 -m lm_eval --model hf \
--model_args pretrained=MoE-Plus-Plus-7B \
--tasks winogrande \
--batch_size 1 \
--output_path Results/winogrande
๐ Acknowledgement
- Skywork-MoE The codebase we built upon and it is an advanced MoE language model.
๐ค Related Projects
- MoH MoH is a promising alternative to multi-head attention and provides a strong foundation for developing advanced and efficient attention-based models.
- Chat-UniVi (CVPR 2024 Highlight) The model is an efficient large language and video assistant. This framework exhibits remarkable interactive capabilities between images and videos.
๐ License
- The majority of this project is released under the Apache 2.0 license as found in the LICENSE file.
- The service is a research preview intended for non-commercial use only, subject to the model License of LLaMA, Terms of Use of the data generated by OpenAI, and Privacy Practices of ShareGPT. Please contact us if you find any potential violations.
โ๏ธ Citation
If you find this paper useful, please consider staring ๐ this repo and citing ๐ our paper:
@article{jin2024moe,
title={MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts},
author={Jin, Peng and Zhu, Bo and Yuan, Li and Yan, Shuicheng},
journal={arXiv preprint arXiv:2410.07348},
year={2024}
}
- Downloads last month
- 38