How did you train the gating?
How was the gating trained here? I don't see positive_prompts
in the model card metadata
It hasn't been trained (yet), unfortunately. It looks like the two first experts are always selected because of that.
I'm working a little bit more on the 2x version before moving to fine-tuning.
you don't need finetune at all.
just to change the value of num_experts_per_tok to 4.
here I have a quick test, https://www.reddit.com/r/LocalLLaMA/comments/193zr7l/the_secret_to_improve_the_performance_of_mixtral/
That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best
That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best
maybe when training, all experts should be active
That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best
this is very small model, doesn't matter
@cloudyu For sure but this kind of defeats the purpose of the architecture, imo. A cheap way of fixing it is just adding noise to the gating weights (@mrclbschff on Twitter did it). A better way is fine-tuning them, especially with a positive prompt initialization à la mergekit.
maybe when fine-tuning, all experts should be active that means to set num_experts_per_tok to max. while training finished, set num_experts_per_tok = 2.
this is my guess.
In this case, you won't train your gating weights to select the best experts (since you're using all of them anyway). I think it's important to have the same number of num_experts_per_tok for training and inference.
Great work. I have a feeling the inference script may be doing something funny with the gate weights. Will report back when done some testing
update!
Now I trained the gating weight by DPO.
here is an example : https://huggingface.co/cloudyu/Mixtral-8x7B-Instruct-v0.1-DPO
another example : https://huggingface.co/cloudyu/Pluto_24B_DPO_200