Downscaling the `Q_q` and `W_k` matrices for repeated layers in franken-merges

#4
by jukofyork - opened

You might be interested in this:

https://github.com/arcee-ai/mergekit/issues/198

I explain 3 different views on why downscaling the W_q and W_k matrices (and thus in turn scale the score matrix) might make sense to do in that thread.

The odd 0.7071067812 constant is just 1/sqrt(2) so the dot-product has the effect of halving the magnitude of all score matrix values.

dtype: float16
merge_method: passthrough
slices:
- sources:
  - layer_range: [0, 10]
    model: miqu-1-70b-sf
- sources:
  - layer_range: [10, 20]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [10, 30]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [20, 40]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [30, 50]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [40, 60]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [50, 70]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [60, 70]
    model: miqu-1-70b-sf
    parameters:
      scale:
        - filter: q_proj
          value: 0.7071067812
        - filter: k_proj
          value: 0.7071067812
        - value: 1
- sources:
  - layer_range: [70, 80]
    model: miqu-1-70b-sf

Copied from the mergekit thread:

Using temperature = 0, repetition penalty = 1 and no system prompt:

### USER
Two people are in a room alone with a machine that requires two people to operate. One person is operating the machine. The machine is functioning properly. What is the other person doing?

### ASSISTANT
The other person is also operating the machine. The statement "requires two people to operate" implies that both individuals are involved in its operation.

https://old.reddit.com/r/LocalLLaMA/comments/1aqxqfr/interesting_answers_i_got_from_miquliz120b_in_a/

### USER
Sally has 3 sisters. Each of her sisters has one brother. How many sisters does Sally have?

### ASSISTANT
Sally has 3 sisters. The information about their brothers is not relevant to the question. So, Sally still has 3 sister.

(this was a copy and paste fail from me, but he didn't fall for it! :grin:)

### USER
Sally (a girl) has 3 brothers. Each brother has 2 sisters. How many sisters does Sally have?

### ASSISTANT
Sally has 2 sisters.

Here's the reasoning:


1. Sally is a girl and has 3 brothers.
2. Each of her brothers has 2 sisters, which means each brother has one sister (Sally) in common with each other, and one more sister in addition to Sally.
3. Since there are 3 brothers, this accounts for 2 additional sisters (1 from each brother's perspective) + Sally herself, making it a total of 3 girls including Sally herself.

So, Sally has 2 sisters in total (excluding herself).

https://old.reddit.com/r/LocalLLaMA/comments/15kntrq/i_asked_a_simple_riddle_to_30_models/

### USER
Solve this family relationship quiz: Marie is Emma's parent. Emma is Carl's parent. Carl is Timothy's parent. Timothy is male. Who's Timothy to Marie?

### ASSISTANT
 Based on the information provided:

1. Marie is Emma's parent.
2. Emma is Carl's parent.
3. Carl is Timothy's parent.
4. Timothy is male.

From point 1, we know that Marie is either the mother or father of Emma (gender not specified).

From point 2, we learn that Emma is the mother or father of Carl (again, gender not specified).

From point 3, we find out that Carl is Timothy's parent. Since Timothy is male, this means that either Carl is his father or his mother. However, since Timothy has already been established as male in the problem statement, it follows that Carl must be Timothy's father.

Now let's put all these relationships together:

1. Marie -> Emma (parent-child)
2. Emma -> Carl (parent-child)
3. Carl -> Timothy (father-son)

Since we know that Marie is Emma's parent, and Emma is the parent of Carl, this makes Marie either the grandmother or grandfather of Timothy. Therefore, Timothy is Marie's great-grandchild or great-grandson/daughter depending on their gender. However, since Timothy has been established as male in the problem statement, it follows that he is Marie's great-grandson.

In conclusion, Timothy is Marie's great-grandson.

https://old.reddit.com/r/LocalLLaMA/comments/1atr1df/using_family_relationship_quizzes_to_measure_the/

https://pastebin.com/MZ7Aj7ZV

Now for the slightly more problematic puzzles this one is getting wrong. The old miqu-1 and miqu-120b could sometimes solve stuff like this if you added "Think step-by-step as this is a tricky problem:" on to the end:

### USER
Solve this family relationship quiz: Betty is Brenda's parent. Betty is Joe's parent. Joe is Charles' parent. Brenda is female. Who's Brenda to Charles? Think step-by-step as this is a tricky problem:

### ASSISTANT
 Let's break this down step by step:

1. Betty is Brenda's parent. This means either Betty is Brenda's mother or father.
2. Betty is Joe's parent as well, so either Betty is Joe's mother or father too.
3. Joe is Charles' parent, which means Joe is either Charles' mother or father.
4. Brenda and Joe are siblings since they share a common parent (Betty).
5. Since Brenda is female, neither she nor her sibling (Joe) can be Charles' father, so Joe must be Charles' mother.
6. Therefore, Brenda is Charles' aunt (either his aunt by blood or his half-aunt if Joe is his/her step-parent).

So Brenda is Charles' aunt (either biologically or through marriage).
### USER
Solve this family relationship quiz: Willie is Sara's parent. Willie is Zachary's parent. Zachary is Kyle's parent. Sara is female. Who's Sara to Kyle? Think step-by-step as this is a tricky problem:

### ASSISTANT
 Let's break this down step by step:

1. Willie is Sara's parent, so Willie is Sara's father or mother.
2. Willie is also Zachary's parent, so Willie is Zachary's father or mother as well.
3. Since Willie cannot be both the father and mother of Sara and Zachary (as they are not twins), we can conclude that either Sara and Zachary are siblings or cousins.
4. However, since there is no mention of any other siblings or spouses in the question, it's safe to assume that Sara and Zachary are siblings.
5. Now, we know that Zachary is Kyle's parent, so either Kyle is Zachary's son or daughter.
6. Since Sara and Zachary are siblings, this makes Sara either Kyle's aunt or aunt-in-law (if she married into the family).
7. However, since there is no mention of any marriages in the question, we can rule out the possibility of Sara being Kyle's mother-in-law.
8. Therefore, Sara must be Kyle's aunt.

So, Sara is Kyle's aunt.
### USER
Solve this family relationship quiz: Nathan is Timothy's parent. Nathan is Aaron's parent. Aaron is Edward's parent. Timothy is male. Who's Timothy to Edward? Think step-by-step as this is a tricky problem:

### ASSISTANT
Let's break this down step by step:

1. Nathan is Timothy's parent. This means that either Nathan is Timothy's father or mother.
2. Nathan is also Aaron's parent, which means that either Nathan is Aaron's father or mother as well.
3. Aaron is Edward's parent, so Aaron could be Edward's father or mother.
4. We know that Timothy is male, so Nathan cannot be Timothy's mother because same-sex marriage isn't mentioned in the question and we are assuming heterosexual relationships for simplicity here.
5. Since Timothy is male, Nathan must be Timothy's father (not mother).
6. Now, if Aaron is Edward's parent, then Aaron cannot be a woman because Timothy has already been established as male. So Aaron must be Edward's father.
7. Therefore, since Nathan is both Timothy and Aaron's parent, this means that Timothy and Aaron are siblings (either twins or close in age).
8. Since Aaron is Edward's father, then Timothy must be Edward's uncle (either his maternal or paternal uncle).

So to answer the question "Who's Timothy to Edward?" The answer is that Timothy is either Edward's maternal or paternal uncle.

This also doesn't seem to "lose" the information like the original miqu-120b did:

### USER
Can you summarise the ideas behind Pollack's 1990 "Recursive distributed representations" paper?

### ASSISTANT
 Sure, I'd be happy to summarize the main ideas in Pollack's 1990 paper "Recursive Distributed Representations."

In this influential paper, Jordan Pollack introduced the idea of using connectionist networks with recurrent connections to model hierarchical representations of natural language. The key insight was that by recursively composing simpler representations into more complex ones, a network could learn to represent complex structures such as sentences or even entire stories.

The main contributions of this paper include:

1. **Recursive compositionality**: Pollack proposed that language understanding and generation can be modeled using a connectionist architecture that recursively composes simpler representations into more complex ones. This allows the network to learn hierarchical representations of natural language, such as sentences or even entire stories.
2. **Connectionist networks with recurrent connections**: The model uses recurrent connections to allow information from previous time steps to influence current processing, allowing the network to maintain context over time and build up complex representations incrementally.
3. **Distributed representations**: Pollack proposed that distributed representations can capture the meaning of words, phrases, and even entire stories by distributing their meanings across many hidden units in a connectionist network.
4. **Recursive compositionality**: The model recursively composes simpler representations into more complex ones, allowing the network to learn hierarchical structures such as sentences or even entire stories.

Overall, Pollack's paper laid the groundwork for modern deep learning approaches to natural language processing and has had a lasting impact on the field of artificial intelligence research.

This is a question I have definitely asked the miqu-120b merge and he didn't know that Pollack's first name was Jordan and hallucinated something else (even though the original miqu-1 model did know his name).

He doesn't seem to have mentioned the key insight from the paper - compressing sequences to a fixed length vector (but I don't know if the original miqu-1 knew this either).

@alpindale @ChuckMcSneed @llmixer @divinetaco

Can I get you guys interested in this too?

I'm currently downloading the necessary HF repos to remake goliath-120b as he and miqu-120b are the only franken-merges I have much experience of using, but you guys have a lot more than me!

@froggeric might be interested in this too!

I'm just retrying Wolfram'smiqu-103b config now.

Also @froggeric this isn't even using any of the better miqu-1 prompt templates either!!! It's just using the stock llama2 prompt template in llama.cpp server!

Sounds quite complicated. Provide me a mergekit config and I'll test it.

@ChuckMcSneed I've tried my best to explain what this does and the reasoning behind it here:

https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2063716974

but I'm pretty clueless when it comes to the merging 'magic' and don't really understand @abacusai and @wolfram reason for this:

https://huggingface.co/abacusai/TheProfessor-155b/blob/main/mergekit_config.yml
https://huggingface.co/wolfram/miquliz-120b-v2.0/blob/main/mergekit_config.yml

Maybe this merge_method: linear secret sauce it what I was missing?

I'm trying again with your equal 16-block layout from WinterGoliath-123b and tokenizer_source: model:Xwin-LM-70B-V0.1 added to the bottom to see what happens.

If this doesn't work then it's most likely because of my cluelessness in making the yaml files for multiple models (it definitely worked and appeared to make an even smarter miqu-1-120b yesterday!).

Hopefully the example yaml files I have tried to make are enough for somebody to work from and explain the reasoning for the different blocks, etc.

@alpindale @ChuckMcSneed @llmixer @divinetaco @froggeric

I've done my absolute best to make the idea as clear as possible here now:

https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2063716974

The actual value to scale the score matrix by definitely needs to be experimented with... Using 0.5 worked fine for miqu-1-120b but doesn't seem to work well for the non-self-merges (at least assuming the horribly complex yaml file needed to match the goliath-120b model is correct). I've added a WolframAlpha link at the bottom that should let you calculate the required value of scale_factor to scale the score matrix by whatever you want.

I quickly experimented with downscaling on https://huggingface.co/divinetaco/aranea-tenebris-120b-v1.0
values tried: 0.5, 0.6, 0.8, 0.9

Perplexity scores in my benchmarks didn't shift, but it did modify the responses on manual inspection, which is interesting.
Using lower values (0.5,0.6) degraded response quality. It was harder to judge with (0.8,0.9). It may have improved. Will do further testing.

doesn't seem to work well for the non-self-merges The lower values might be especially bad in tenebris due to the irregular merge pattern.

Seeing Miquliz's merge config again did jog my memory regarding @wolfram 's trick with the 0 weight merge on first / final layer - got some more nice improvements aplying it to tenebris. I might have to upload a v1.1 once I am done toying with projection scale values.

doesn't seem to work well for the non-self-merges The lower values might be especially bad in tenebris due to the irregular merge pattern.

I also think that one of the other projection values might need to be scaled too: by scaling the score matrix like this we are really just changing the "sharpness" of the softmax output that gets multiplied by the V matrix (in a way this makes the model "less confident"). This can only indirectly deal with the "overshoot" problem - by hoping the more values from the V matrix you blend together; the more they cancel each other out. The composition of two transformer blocks from different fine-tunes may be cancelling a lot more and hence why scaling the score matrix by 0.5 (ie: using 1/sqrt(2)) isn't working very well.

I'm going on holiday for a few day tomorrow so will be late next week before I can do any more on this. I need to really have a close look at the exact architecture to see if something could be scaled after all the non-linearity so as to avoid the "overshoot" problem... Messing with the input_norm like I thought might work is obviously a bad idea as this will end up double scaling q_proj, k_proj, and also altering v_proj (which we definitely don't want to do; as the following MLP activations can't accept something with a lower magnitude as input and be expected to work properly).

Seeing Miquliz's merge config again did jog my memory regarding @wolfram 's trick with the 0 weight merge on first / final layer - got some more nice improvements aplying it to tenebris. I might have to upload a v1.1 once I am done toying with projection scale values.

Please share if you figure out a way of incorporating this! When I get time I'm going to try this on the (stock) goliath-120b to see if it makes any difference.

image.png
Got it finally tested. The results are mixed.

For Miqu:

  • +Styles became more explicit
  • +Used emojis in one of the styles unprompted (first time a local model did it unprompted on my tests!)
  • +Less verbose than the normal 120b in a good sense
  • +Larger vocabulary than the normal 120b
  • +A bit less censored
  • -Failed one style almost completely (got 0.25 for it, while the others were perfect. The normal 120b got all of them.)
  • -Repetition problem in my poem tests
  • -A bit dumber?

Verdict: Needs a bit of adjustment, but it's mostly good!

For Goliath:

  • +Stronger expressed styles
  • +Larger vocabulary
  • -Made-up words became significantly more prevalent
  • -Severe brain damage
  • -Huge repetition problem in my poem tests

Verdict: This method didn't work here. The created model is too degraded in performance. Not worth it.

Yeah, my attempt with goliath-120bwas terrible too.

It could just be the scaling needs to be not so low for multi-model merges as just using multiple models that are less correlated with each other might be also be deattenuating the score matrix by pulling and pushing in opposite directions.

The value of 0.5 (ie: scale-factor = 1/sqrt(2)) doesn't really have any good theoretical reason for being optimal (for self-merges or multi-model merges alike) too, so it may well be worth experiencing with it to see what effects it has: slightly higher values might keep the apparent improvements but get rid of some of the problems.

As I said in the Mergekit thread: it may be worth doing this on a straight copy of a model (ie: pass through but use all of one model) just to see what effect it has and what sort of failure modes happen.

It might also be interesting to set the value higher than 1 as (I think) this will have the effect of making the model extra confident and could possibly have interesting properties.

@ChuckMcSneed @divinetaco @divinetaco

I've made a pretty big post outlining what I've found in the Mergekit thread:

https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2070900670

Hopefully this will inspire some new merges soon!

@jukofyork

https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2070900670

Very interesting. I did a few knock-out experiments myself when creating this model with the aim of getting absurdly high score on my own benchmark.

I replaced parts(noted as xx-yy KO in the image below) of the finetuned models with base llama2 to see if it would make a difference, and it indeed did:
image.png
Scores of the benchmark decreased, but contrary to my expectations, they did not all behave in the same way. Some were "front-heavy"(xwin, spicyboros, dolphin), others "end-heavy"(euryale, wintergoddess, wizard). All of them showed degraded performance when the middle was knocked out. Alpindale made a lucky guess when putting xwin in the front, if euryale was first, goliath probably wouldn't be very good.

@jukofyork

https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2070900670

Very interesting. I did a few knock-out experiments myself when creating this model with the aim of getting absurdly high score on my own benchmark.

I replaced parts(noted as xx-yy KO in the image below) of the finetuned models with base llama2 to see if it would make a difference, and it indeed did:
image.png
Scores of the benchmark decreased, but contrary to my expectations, they did not all behave in the same way. Some were "front-heavy"(xwin, spicyboros, dolphin), others "end-heavy"(euryale, wintergoddess, wizard). All of them showed degraded performance when the middle was knocked out. Alpindale made a lucky guess when putting xwin in the front, if euryale was first, goliath probably wouldn't be very good.

Yeah, this looks like it matches quite close to area before the yellow / "less important" blocks form the paper:

img

@froggeric @ChuckMcSneed @divinetaco

https://github.com/arcee-ai/mergekit/issues/198#issuecomment-2079950009

These seem to be working really well:

  • No apparent brain damage, no repeating (with --repeat-penalty 1.1) and no informational loss as far as I can tell.
  • For both self-merges and multi-model-merges we set the RESIDUAL_SCALE_FACTOR = sqrt(1/2) for each duplicated layer (see reasoning in previous post + empirical evidence that this seems to work in the last post).
  • For self-merges we set the QK_ATTENUATION_FACTOR = sqrt(1/2) for the first copy of a layer only.

It seems setting QK_ATTENUATION_FACTOR < 1 for multi-model-merges just causes them to get dumber - if the repeated layers for different models are already doing something different then perturbing it further makes less sense, whereas for self-merges it does make sense to attenuate the score matrix to give the model a "bird's eye view" of the context before homing in for the second copy.

Interested to see what you guys make of this:

  • The miqu-1 self-merge seems just as smart as miqu-120b from the puzzle tests and actually seems really good at writing too (actually better than goliath-120 from my bank of 10 tests!).
  • The goliath-eqsue multi-model-merge seems to have comparable writing ability to goliath-120 from my bank of 10 tests.

The same templates should work for other self-merges and multi-model-merges too (gonna try a wintergoliath-eqsue version now).

An interesting experiment is to reduce the RESIDUAL_SCALE_FACTOR further (towards 1/2): it can break the models and cause repeating, but it can also add interesting (slightly unhinged) "novelties" to the stories. It also seems to make the output slightly longer too.

If you want to merge 3 copies/models then RESIDUAL_SCALE_FACTOR = sqrt(1/3) for all copies should work in theory, and:

  • QK_ATTENUATION_FACTOR = {sqrt(1/3), sqrt(1/2), 1}
  • QK_ATTENUATION_FACTOR = {sqrt(1/4), sqrt(1/2), 1}
  • QK_ATTENUATION_FACTOR = {sqrt(1/2), (1+sqrt(1/2))/2, 1}

All seem like sensible things to try for multi-model merges.

Sign up or log in to comment