Details of all the merge attempts until this one

#1
by froggeric - opened

I was theorising that the best self-merges should use adjacent layers between slices. To prove my theory, I started testing on miqu-1, but as this was too slow to test properly, I switched to what I consider the current best 7B model, WestLake-7B-v2. This quickly showed my theory was wrong. In fact, it showed that getting a good formula for a self-merge is difficult. Most merges exhibited significant errors, and the remaining had less apparent errors, while not offering any significant improvement over the original.

Until I came up with this one. During all my testing, it has not exhibited any error, and showed a clear improvement over the original. I do not know yet if those merge settings can be applied to any Mistral based 7B model, or are only specific to this one.

Here are all the settings I have tried:

image.png

Loving this data. I'd like to try exploring this parameter space by automating merges & running EQ-Bench on each permutation.

Are there other parameters than these that are worth exploring as well?

I would say the 3 main parameters are:

  • the size of the slices (in layers)
  • the backward jump (in number of layers)
  • the size of the last slice

I have tested more than what is just shown here, and 2 things I noticed:

  • the size of the first slice does not matter.
  • the size of the last slice need be a minimum of 6 layers. If less it always produce complete gibberish as output.

A few other things I have started to notice/suspect:

  • the backward jump need to be greater than 2, otherwise information stored between layers is irremediably lost
  • the backward jump should be at most half the size of the slice, otherwise some layers are repeated more than 2 times
  • when layers are repeated more than 2 times, it quickly degrades the quality
  • the last slice should be the same size of bigger than any other slice
  • the layers importance seems to start low and steadily increase until the final layer
  • taking into account the previous point, identically sized slices might not be the optimal solution, but instead increasing slice sizes could be better (eg: 4,5,6,7,8,9,10)

When I have time, I need to put together all those observations and results, and post it on reddit/r/localllama

froggeric changed discussion status to closed

Thanks, that is super informative. You and @mlabonne should compare notes.

froggeric changed discussion status to open

If you come on over to the KoboldAI discord ( https://discord.com/channels/849937185893384223/851923311172517908 ) we've actually got a few people working through some similar experiments like this, myself included.

  • The layer importance changing over the course of the model absolutely meshes with what we've been theorising / seen from analysis of the tensors
  • We've actually been running with the idea that the first and last slices need to be >=8 layers (for anything built from a 7B, >=10 for an 11B). We've done some digging into the tensor values, and found that there's a noticeable shift in how they "behave" at those points across most models; and that cutting either of them down by much more degrades quality pretty rapidly.
  • Like you, we did also find that the last slice can go down to ~6 before it completely degrades, but that 7 and 8 both improve it slightly, and 8 especially makes the slicing math easier. (8 start, 8 end leaves 16 in the middle, or 32 when building a 'standard' 11B)
  • The more than 2 duplicates also matches what we're finding; 2 seems to be a noticeable improvement, but then 3 is an instant degradation
  • With self stacks, we've found that duplicating every single layer between 8 and (n-8) actually produces an improved and coherent output. I don't think we've seen any real evidence in either our testing or the analysis of the tensor data of information "between" layers.
  • Duplicating in 4s, 8s and 16s are all noticeably worse, as were 16 back 8, 8 back 4, but I don't think we've actually tested 4 back 2s yet: going to have to try that out.
  • SOLAR is a weird beast that behaves differently in a lot of ways, because it's an inherently "broken" stack that's been beaten into effectiveness by extensive training. This has made it exceedingly difficult to stack, and complicated merging with non-SOLAR based models.

@akaistormherald Thank you for the detailed information. I will try to head up to the discord server.

Many of the general concepts in the way LLM work derive from "Recursive distributed representations", J. B. Pollack - Artificial Intelligence 46 (1-2):77-105 (1990). Here is what we can infer from it:

Input Layers: This is where the raw information comes in. And its role it to capture the basic features or building blocks from the input data.

Hidden Layers: These are the heart of the network. Each layer takes the output from the previous layer and does some processing on it. We can say all the hidden layers are important, but their importance might differ. As information progresses through the network, higher layers learn increasingly complex and abstract representations by combining information from lower layers.

Early Hidden Layers (closer to input): These layers might focus on identifying basic building blocks or features from the data. Their work is crucial, but it might not be as complex as later stages.

Later Hidden Layers (closer to output): These layers take the processed information from earlier layers and combine it to form a more complete picture. Their work builds on the foundation laid earlier and plays a bigger role in the final outcome.

Final Layers: They take the refined and combined information from the hidden layers and transform it into the network's answer, prediction, or classification. Since the final layer directly determines the network's output, it has a significant impact on the overall performance. Errors that accumulate through the hidden layers can be amplified in the final layer, leading to inaccurate results.

This may be of interest:
https://github.com/arcee-ai/PruneMe

It maps the differences (and importance) of the layers in a model.

Another option is using "AutoGPTQ" -> GPTQing a model
-> Watch the output-> It shows very detailed errors in the layers.

I have been merging models using pass-through method - 4 models at once
-> FOUR 13B models into a 20B super model.

It was a chore to get them to "behave" (based on reading a lot of "papers" and trial and error) but it can be done.
A lot of the facts noted in this discussion apply...

Run merge -> Create GGML 16 -> Q4KM -> Measure perplexity (wiki.train) -> to test stability.
For the merges I have working (that is they don't explode or go crazy) a value of 15 to 9 (q4km) is usable.

Gonna need a bigger boat -> next is a 30B model (of these four 13 B models ).

Going to upload these (20B) at my repo shortly... been testing / creating locally because COLAB explodes at this size of merge.

Sign up or log in to comment