Union of Experts: Adapting Hierarchical Routing to Equivalently Decomposed Transformer
Abstract
Mixture-of-Experts (MoE) enhances model performance while maintaining computational efficiency, making it well-suited for large-scale applications. However, expert in exist MoE paradigm works as an individual, thereby lacking high-quality expert interactions. Moreover, they have not been effectively extended to attention block, which constrains further efficiency improvements. To tackle these issues, we propose Union-of-Experts (UoE), which decomposes transformer into an equitant group of experts, and then implement dynamic routing on input data and experts. Our approach advances MoE design with three key innovations: (1) We conducted equitant expert decomposition on both MLP blocks and attention blocks based on matrix partition in tensor parallelism. (2) We developed two routing paradigms: patch wise data selection and expert selection, to apply routing across different levels. (3) We design the architecture of UoE model, including Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). (4) We develop parallel implementation of UoE's routing and computation operation, and optimize efficiency based on the hardware processing analysis. The experiments demonstrate that the model employed with UoE surpass Full Attention, state-of-art MoEs and efficient transformers in several tasks across image and natural language domains. The source codes are available at https://github.com/YujiaoYang-work/UoE.
Community
This paper propose a new concept method: Union-of-Experts. Compared to existing MoE method, it build expert groups by equivalently decomposing a whole model rather than combining multiple individual models together. This approach allows experts to operate as a larger whole rather than a mixture of individuals, which fully leverage the scale effect of the model.
The architecture of UoE model includes Selective Multi-Head Attention (SMHA) and Union-of-MLP-Experts (UoME). The SMHA shares some similarities with the NSA introduced by DeepSeek half a month ago and the MoBA from Moonshot.AI, even though it was independently developed by the author over the course of a year.
On the other hand, the UoME represents a novel architecture that not only inherits the multi-expert and selection routing paradigms from existing MoE models but also enables the activated experts to function as a cohesive whole similar to an MLP of the same scale.
The benefits of applying the ideas of equivalent decomposition and routing to a complete Transformer model are quite evident. The experiments demonstrate that the UoE model surpass Full Attention, state-of-art MoEs and efficient transformers (including the model architecture of recently proposed DeepSeek-V3) in several tasks across image and natural language domains.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Each Rank Could be an Expert: Single-Ranked Mixture of Experts LoRA for Multi-Task Learning (2025)
- Powerful Design of Small Vision Transformer on CIFAR10 (2025)
- Shared DIFF Transformer (2025)
- DeRS: Towards Extremely Efficient Upcycled Mixture-of-Experts Models (2025)
- UniMoD: Efficient Unified Multimodal Transformers with Mixture-of-Depths (2025)
- Tensor Product Attention Is All You Need (2025)
- Transforming Vision Transformer: Towards Efficient Multi-Task Asynchronous Learning (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper