Discuss benefits of this work
I am not in the society of mistral, and sorry to ask certain silly questions.
According to the blog, the value of this work is training a 47B model but it only cost 13B during inference. Somewhat "using a 47B model with 13B inference speed (but using 43B VRAM?)". Is there anything else I am missing?
Another question is related to the performance. I take this model something between 47B and 13B. And I justed checked the leaderboard, there are some 34B models have higher average score. I know that the real use is not equal to the scores of benchmarks but I would like to listen to those who have more experience using these models. Could you kindly provide some insights?
@Starlento
thats because this is a moe model.
you see this is made of 8 7b param models trained on different data. a easy way to think about it is like one trained on science stuff, another is trained on math stuff, and another is trained on roleplay. They are probably not trained like this but its somewhat similar.
You might say 8x7 does not equal 47, and the reason mixtral is 47b parameters is because some of the parameters are shared so its actually 47b.
The reason it uses so much vram but at 13b inference speed is because of its architecture.
You MUST load all the models so it will take a very large amount of vram(same as a normal 47b model)
However, when doing actual inference you just need to use 2 of the best models suited to answer the question. so 2x7 = 14b, so roughly 13b speed.
The 2 models might change depending on the instruction you input so all the 8 models have to be preloaded before.
Mixtral is excellent for its size and performs really well at instruction tasks. It has decent benchmark scores so far but it can be easily increased much more when the community can finetune it even further.