Post
288
π₯ 77.2% on MMLU with 3.7B parameters π
... 3.7B active parameters, 40B in total parameters π
7.4 GFlops forward computation per token, 1/19 of Llama3-70B π
Exciting enough? π²
That's Yuan2-M32 for you, released by IEIT-Yuan.
A new 40B Mixture of Experts using a new Attention Router mechanism π§
32 experts with 2 active in generation βοΈ
8,192 context length π
Trained on 2T tokens, using 9.25% of the compute required by the dense models π οΈ.
Yuan 2.0-M32 employs fine-tuning techniques to adjust to longer sequence lengths, utilizing a modified base value in the Rotary Position Embedding to maintain performance over extended contexts π.
Open-source - Apache 2.0 π
Vocabulary size of 135,040 π£οΈ
Outperforms Mixtral 8x7B (47B total parameters, 12.9B active parameters) on all benchmarks and almost gives Llama 3 70B run for its money πΈ
Models: https://huggingface.co/IEITYuan π
Paper: Yuan 2.0-M32: Mixture of Experts with Attention Router (2405.17976) π
... 3.7B active parameters, 40B in total parameters π
7.4 GFlops forward computation per token, 1/19 of Llama3-70B π
Exciting enough? π²
That's Yuan2-M32 for you, released by IEIT-Yuan.
A new 40B Mixture of Experts using a new Attention Router mechanism π§
32 experts with 2 active in generation βοΈ
8,192 context length π
Trained on 2T tokens, using 9.25% of the compute required by the dense models π οΈ.
Yuan 2.0-M32 employs fine-tuning techniques to adjust to longer sequence lengths, utilizing a modified base value in the Rotary Position Embedding to maintain performance over extended contexts π.
Open-source - Apache 2.0 π
Vocabulary size of 135,040 π£οΈ
Outperforms Mixtral 8x7B (47B total parameters, 12.9B active parameters) on all benchmarks and almost gives Llama 3 70B run for its money πΈ
Models: https://huggingface.co/IEITYuan π
Paper: Yuan 2.0-M32: Mixture of Experts with Attention Router (2405.17976) π