Maybe a slerp or some other merge method will preserve the component experts better?

#2
by BlueNipples - opened

Just a thought. Would be great if we could get Mixtral down to 3-4 experts for lower end hardware. Given it only needs two at a time there's probably a lot of redundancy.

Here we are prototyping a mixtral model that extracts experts :)
https://huggingface.co/mmnga/Mixtral-Extraction-4x7B-Instruct-v0.1

In the notebook for conversion, You can choose which experts to target.
convert_mixtral_8x7b_to_4x7b_extract.ipynb

Thank you!
Changed the merge method to slerp.
We were able to improve the quality of the output :)

Thank you!
Changed the merge method to slerp.
We were able to improve the quality of the output :)

Tremendous!

Yes, have had to use slerp and other more complicated merge methods (gradient merge, ties merge) to preserve as much of the two models nuances as possible. It's likely with a MoE model you need to preserve as much of the differences as possible.

Sign up or log in to comment