Question about MoA

#2
by TechxGenus - opened

Congratulations on this amazing work! I noticed that unlike Mixtral/DeepseekMoE/QwenMoE, multiple experts are also added for the attention layer. How would this affect the results?

Sign up or log in to comment