Very interesting.

#1
by Haiart - opened

Interesting idea, have you considered other models like: Starling-LM-7B-alpha, Chupacabra-v3, Zephyr-7b-beta and OpenChat-3.5?
Maybe a Frankenmerge between these Six models (including these two over here OpenHermes-2.5 and Neural-Chat-3.1) resulting in 3 different 11B models, then merging all three with Dare_Ties.

Also, could you upload the Q8_0 GGUF? Thank you in advance.

yes, i uploaded here https://huggingface.co/S4sch/Open-Hermes-2.5-neural-chat-3.1-frankenmerge-11b-gguf-q8 a q8 and a q5. However, while talking to the model, it has sometimes weird repetitions issues, i think it might have to do with the eos tokens. I will try to do another merge where i unionize the special tokens of both and see whether it is better.

And yes, i can definitely try that! I'm currently figuring out, which models are good to merge, as i realized that if they are different in the architecture or the special tokens and how they got trained, the result can be quite weird. In the next few days i will try out some different merges and see what is good and what not.

Interestingly enough, all models that I recommended above are all based on Mistral-v0.1 but of course, they employ different training methods and whatnot.
Unionizing the tokens is sensible and welcomed, but I somehow didn't face any repetition issues, but my settings are probably different than yours.

The idea above was simple on paper, since it's just picking two models from the Six available, "Frankenmerging" two of them, until you have three different 11B ones (without using a single model twice on the Frankenmerge) and then Merging all three together (preferably with Dare_Ties) for a definitive 11B Mistral based model, but like you mentioned, it has issues, if you're going to try anyway, I'll be looking forward to it. I do think that they have good compatibility since they're all based on a single model after all.

Hello S4sch, this is very interesting. I'm testing DARE and DPO right now, but my 20b models are prohibitively large for DPO training. I'm going to do some iterations on your strategy.

A model following the same layout but made of only the base model exhibits the same eos issues, unfortunately. It seems to be an effect of the layer layout.

Hi athirdpath! yes, that really seems to be case. It would make sense that if the models are trained in their own specific format, that if we merge them it can easily get confused. It is interesting though that its not always the case and in some tests it can still do quite well, its just not very reliable.

Sign up or log in to comment