Papers
arxiv:2405.17976

Yuan 2.0-M32: Mixture of Experts with Attention Router

Published on May 28
· Submitted by akhaliq on May 29
#3 Paper of the day
Authors:
,
,
,
,
,
,
,
,

Abstract

Yuan 2.0-M32, with a similar base architecture as Yuan-2.0 2B, uses a mixture-of-experts architecture with 32 experts of which 2 experts are active. A new router network, Attention Router, is proposed and adopted for a more efficient selection of experts, which boosts the accuracy of 3.8% compared to the model with classical router network. Yuan 2.0-M32 is trained with 2000B tokens from scratch, and the training computation consumption is only 9.25% of a dense model at the same parameter scale. Yuan 2.0-M32 demonstrates competitive capability on coding, math, and various domains of expertise, with only 3.7B active parameters of 40B in total, and 7.4 GFlops forward computation per token, both of which are only 1/19 of Llama3-70B. Yuan 2.0-M32 surpass Llama3-70B on MATH and ARC-Challenge benchmark, with accuracy of 55.89 and 95.8 respectively. The models and source codes of Yuan 2.0-M32 are released at Github.

Community

There's a simple-english rewrite of the paper here - feedback from the authors is welcome! https://www.aimodels.fyi/papers/arxiv/yuan-20-m32-mixture-experts-attention-router

Sign up or log in to comment

Models citing this paper 6

Browse 6 models citing this paper

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2405.17976 in a dataset README.md to link it from this page.

Spaces citing this paper 1

Collections including this paper 8