@alielfilali01 on Hugging Face: "Honestly i don't understand how come we as the open source community haven't…"

alielfilali01

posted an update Apr 7, 2024

Post

2184

Honestly i don't understand how come we as the open source community haven't surpassed GPT-4 yet ? Like for me it looks like everything is out there just need to be exploited! Clearly specialized small models outperforms gpt4 on downstream tasks ! So why haven't we just trained a 1B-2B really strong general model and then continue pertained and/or finetuned it on datasets for downstream tasks like math, code...well structured as Textbooks format or other datasets formats that have been proven to be really efficient and good! Ounce you have 100 finetuned model, just wrap them all into a FrankenMoE and Voila ✨
And that's just what a NOOB like myself had in mind, I'm sure there is better, more efficient ways to do it ! So the question again, why we haven't yet ? I feel I'm missing something... Right?

samusenps

Apr 7, 2024

🤔

AIIAR

Apr 7, 2024

•

edited Apr 7, 2024

very true all has not yet been done there are still innovations ahead

Shaleen123

Apr 8, 2024

You are free to try 😉

alielfilali01

Apr 8, 2024

Not really 😕

Erilaz

Apr 10, 2024

•

edited Apr 10, 2024

The issue is all those experts have to be very diverse and trained more or less simultaneously.
Because if you are going to use sparse MoE, your router model has to be able to predict the fittest expert for the upcoming token. Which means router has to be trained with the experts. That wouldn't be an issue for classic MoE, but both kinds of models also rely on the experts' uniform "understanding" of the cached context. I don't think a 100x2B model would work without that well enough. That's the reason why Mixtral fine-tuning is such a complicated task.
Not only that, we don't really have a good base 2B model. Sure, Phi exists... With 2K ctx length, no GQA, coherency issues and very limited knowledge. I don't think the point of "expert" is providing domain-specific capabilities into the composite model, I think the trick is overcoming the diminishing returns in training, as well as some bandwidth optimizations for inference. So among your 100 experts, one might have both an analog for Grandma cell and some weights associated with division. Another expert could be good at both kinds of ERP - being Enterprise Recourse Planning and the main excuse for Frankenmerges creation, lol. The model distillation becomes better over time, but I don't think any modern 2B can help to compete with GPT-4. Perhaps 16x34B could, but good luck training that from scratch as a relatively small business, let alone nonprofit or private individual.

Join the conversation