Hybrid approach?
So I saw the work momonga is doing here to extract the models from Mixtral 8x, so you have less experts. https://huggingface.co/mmnga/Mixtral-Extraction-4x7B-Instruct-v0.1
I wondered if it might be possible to have a hybrid approach combining his approach and yours?
For example extract 2 of the experts from the original model, using the original gating they have, and mix that with one new MoE expert with synthetic gating for a 3x7b with pretty decent gating off the bat?
Obviously all this is experimental and crazy, and maybe this wouldn't work at all. But if you could somehow encorporate some the original experts and gating the resulting MoE might be more effective (and possibly a better point to fine tune from)?
Oh, and perhaps you could merge models with the original 8 models too (merge 4 mistral fine tunes with each mistral expert and merge the gating between the original and new synthetic based ones)
Just rambling π€·ββοΈπ€£