@osanseviero on Hugging Face: "I finished my model merging experiment day.🤗I would love your thoughts on…"

osanseviero

posted an update Jan 10

Post

I finished my model merging experiment day.🤗I would love your thoughts on this.

What did I do? I merged Mistral Instruct 0.1 and 0.2 models using different merging techniques:
- SLERP: linear interpolation (most popular method)
- MoE: replace some forward layers with MoE layers; using a random gate for now
- Frankenmerge: also known as passthrough, but that isn't very cool. It concatenates some specified layers ending in different numbers of params. In my case, I went from 7B to 9B.

Note: merging is not building an ensemble of models. You can read more about merging techniques at https://huggingface.co/blog/mlabonne/merge-models

Results
I built the 3 models using mergekit (running in an HF Space) - took less than an hour to do the three) osanseviero/mistral-instruct-merges-659ebf35ca0781acdb86bb0a

I'm doing a quick check with the OpenLLM Leaderboard.
🚨The OpenLLM Leaderboard is more suitable for pre-trained models than instruct models, but I still thought it would be interesting to look at the insights🚨

You can look at the attached image. Some interesting things
- All three models performed somewhere between 0.1 and 0.2 - congrats to the 140 people who got it right in https://twitter.com/osanseviero/status/1745071548866736171
- Frankenmerge terribly sucked with GSM8K. It seems that adding some Mistral 0.1 layers actually degraded the performance a lot - this is worse than even 0.1!
- Otherwise, frankenmerge was decent across HellaSwag, MMLU, and specially TruthfulQA
- MoE is using random gating, so I expected something right in between 0.1 and 0.2, which was the case

What do I do with this?
Not sure tbh! I think doing proper MT bench evals would be nice. I also think all of us should give a nice GH star to mergekit because it's awesome. I would love to have the time to do end-to-end ablation studies, but cool new things are coming up. Let me know if you have any thoughts in the results

osanseviero

Jan 10

And yes...it seems spreadsheets are still the most used tool out there

julien-c

Jan 11

i was expecting dataset + dataset-viewer but this also works 😁

mlabonne

Jan 10

Really cool to see these results, thank you!

sayakpaul

Jan 11

How are the params of the MoE layers populated, though? It doesn't impact the performance? What's the intuition? 😟

osanseviero

Jan 11

•

edited Jan 11

Iiuc, this is a very naive way of merging the models. We replace some forward blocks with MoE blocks (a bit as the image in https://huggingface.co/blog/moe#what-is-a-mixture-of-experts-moe). So, we replace the FFN layer with FFN layers from different models (which hence requires models to be of the same size).

The only missing part is the router. mergekit allows to do a small fine-tuning by providing positive/negative examples of what you want each expert to handle, which is great when you want each expert to handle a certain type of task (e.g. math, programming, etc). In this case, as we don't have task-specific experts, we just do a random gate (which is an option in mergekit).

This MoE merging technique is quite popular nowadays and being used by most of the top models in the LLM leaderboard, such as https://huggingface.co/cloudyu/Mixtral_34Bx2_MoE_60B. And yes, it impacts performance a bit as we now can have more active params + we need a larger GPU to hold the extra params

duxans

Jan 12

Cool work!

aiswaryasankar

Jan 13

Has anyone tried the codellama models with mergekit so far?

Join the conversation