Attention, stupid question

#4
by Debich - opened

Does combining models into one actually make a lot of sense? Are we really getting +50B parameters that really affect the work or is everything not as smooth as it might seem? For some reason, it seems to me that combining them is not very effective, since the models have similar neurons. How is this problem solved during model merging? Correct me if I'm wrong somewhere. I'm just guessing a lot, but I don't really know anything.

I wouldn't be surprised if I get a response along the lines of "I don't know, it just works the way I think it does."

In general, efficiently combining models results in compressing the information contained in the weights, which explains the performance boost we observe.

In this case, the self-merge provides extra layers for additional processing, allowing the model to refine the inputs even more. Because it hasn't been fine-tuned, results can be quite chaotic. Typically, this doesn't work well with small models. I can't give a good explanation for why it works for this specific merge, but it looks like the extra processing helps in creative tasks but can degrade the quality for other prompts (like reasoning).

I just wanted to say, this model is amazing! I've been building an assistant with it and comparing it along the way to 70B and there is something about this one that is magical. The outputs are different, the style, the personality. Even when it comes to reasoning I've noticed that its seeming desire to want to solve the problem often more than makes up for whatever quirks it has.

Sign up or log in to comment