How did you train the gating?

#6
by osanseviero HF staff - opened

How was the gating trained here? I don't see positive_prompts in the model card metadata

It hasn't been trained (yet), unfortunately. It looks like the two first experts are always selected because of that.

I'm working a little bit more on the 2x version before moving to fine-tuning.

you don't need finetune at all.
just to change the value of num_experts_per_tok to 4.
here I have a quick test, https://www.reddit.com/r/LocalLLaMA/comments/193zr7l/the_secret_to_improve_the_performance_of_mixtral/

That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best

That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best

maybe when training, all experts should be active

That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best

this is very small model, doesn't matter

@cloudyu For sure but this kind of defeats the purpose of the architecture imo. A cheap way of fixing it is just adding noise to the gating weights (@mrclbschff on Twitter did it). A better way is fine-tuning them, especially with a positive prompt initialization à la mergekit.

@cloudyu For sure but this kind of defeats the purpose of the architecture, imo. A cheap way of fixing it is just adding noise to the gating weights (@mrclbschff on Twitter did it). A better way is fine-tuning them, especially with a positive prompt initialization à la mergekit.

maybe when fine-tuning, all experts should be active that means to set num_experts_per_tok to max. while training finished, set num_experts_per_tok = 2.
this is my guess.

In this case, you won't train your gating weights to select the best experts (since you're using all of them anyway). I think it's important to have the same number of num_experts_per_tok for training and inference.

Great work. I have a feeling the inference script may be doing something funny with the gate weights. Will report back when done some testing

update!
Now I trained the gating weight by DPO.
here is an example : https://huggingface.co/cloudyu/Mixtral-8x7B-Instruct-v0.1-DPO
another example : https://huggingface.co/cloudyu/Pluto_24B_DPO_200

Sign up or log in to comment