How did you train the gating?

by osanseviero - opened Jan 11

Discussion

osanseviero

Jan 11

How was the gating trained here? I don't see positive_prompts in the model card metadata

mlabonne

Owner Jan 11

It hasn't been trained (yet), unfortunately. It looks like the two first experts are always selected because of that.

I'm working a little bit more on the 2x version before moving to fine-tuning.

cloudyu

Jan 12

you don't need finetune at all.
just to change the value of num_experts_per_tok to 4.
here I have a quick test, https://www.reddit.com/r/LocalLLaMA/comments/193zr7l/the_secret_to_improve_the_performance_of_mixtral/

osanseviero

Jan 12

That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best

cloudyu

Jan 12

That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best

maybe when training, all experts should be active

cloudyu

Jan 12

That would activate more experts which would increase the number of activated parameters. Ideally we should be able to route tokens to the expert that will handle them the best

this is very small model, doesn't matter

mlabonne

Owner Jan 12

@cloudyu For sure but this kind of defeats the purpose of the architecture imo. A cheap way of fixing it is just adding noise to the gating weights (@mrclbschff on Twitter did it). A better way is fine-tuning them, especially with a positive prompt initialization à la mergekit.

cloudyu

Jan 12

@cloudyu For sure but this kind of defeats the purpose of the architecture, imo. A cheap way of fixing it is just adding noise to the gating weights (@mrclbschff on Twitter did it). A better way is fine-tuning them, especially with a positive prompt initialization à la mergekit.

maybe when fine-tuning, all experts should be active that means to set num_experts_per_tok to max. while training finished, set num_experts_per_tok = 2.
this is my guess.

mlabonne

Owner Jan 12

In this case, you won't train your gating weights to select the best experts (since you're using all of them anyway). I think it's important to have the same number of num_experts_per_tok for training and inference.

NTaylor

Jan 24

•

edited Jan 24

Great work. I have a feeling the inference script may be doing something funny with the gate weights. Will report back when done some testing

cloudyu

Jan 24

update!
Now I trained the gating weight by DPO.
here is an example : https://huggingface.co/cloudyu/Mixtral-8x7B-Instruct-v0.1-DPO
another example : https://huggingface.co/cloudyu/Pluto_24B_DPO_200

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment