4 experts per token?

#2
by JoggyMuffin - opened

Is there a reason for the change from 2 to 4 experts per token? I thought 2 was supposed to be the "sweet spot". Also, if all experts are being used for every token, is it different than an ordinary merge?

Steel Storage org

Setting experts to four just means you now have four "eyes" on every token, in-fact it has shown to help improve quality of the output on Mixtral 8x7, altho there is a drop off after so many extra experts as it does increase processing time slightly.

also it is still very different to a regular merge, a single model is just a jack of all trades and master of none it just merges all knowledge, while MoEs or multiple experts all have there own specialties or domains they are experts in, hence "Mixture if experts".

Even if it's not as good, could you still make an alt version with 2 experts per token? For me, this version with 4 is just too slow :\

Steel Storage org

Sure! I'll see if I can get one posed soon

Steel Storage org

here is the two expert version!

Steelskull/Umbra-v3-MoE-4x11b-2ex

I would be interested in seeing beckmarks between the two versions, just to verify objectively how much impact the extra experts make.

Steel Storage org

i actually though that as well it has been submitted to the leaderboard for a bench

Sign up or log in to comment