Goliath-123b using same layer ranges as WinterGoliath-123b?

#2
by jukofyork - opened

Do you know if anybody has tried the same layer ranges as WinterGoliath-123b uses to recreate Goliath-123b:

slices:
  - sources:
    - model: xwin
      layer_range: [0, 16]
  - sources:
    - model: euryale
      layer_range: [8, 24]
  - sources:
    - model: xwin
      layer_range: [16, 32]
  - sources:
    - model: euryale
      layer_range: [24, 40]
  - sources:
    - model: xwin
      layer_range: [32, 48]
  - sources:
    - model: euryale
      layer_range: [40, 56]
  - sources:
    - model: xwin
      layer_range: [48, 64]
  - sources:
    - model: euryale
      layer_range: [56, 72]
  - sources:
    - model: xwin
      layer_range: [64, 80]
merge_method: passthrough
dtype: float16

instead of:

slices:
  - sources:
    - model: xwin
      layer_range: [0, 16]
  - sources:
    - model: euryale
      layer_range: [8, 24]
  - sources:
    - model: xwin
      layer_range: [17, 32]
  - sources:
    - model: euryale
      layer_range: [25, 40]
  - sources:
    - model: xwin
      layer_range: [33, 48]
  - sources:
    - model: euryale
      layer_range: [41, 56]
  - sources:
    - model: xwin
      layer_range: [49, 64]
  - sources:
    - model: euryale
      layer_range: [57, 72]
  - sources:
    - model: xwin
      layer_range: [65, 80]
merge_method: passthrough
dtype: float16

It seems strange that Goliath-120b uses these odd layers and makes me wonder if it was deliberate or if the creator thought the upper index was inclusive?

I'd be interested to see if this does anything to improve the "spelling mistake problem" that Goliath-120b has but not WinterGoliath-123b.

I'll recreate it when I get the time. I'm currently making meme models again(12h making+2h testing per model+100% concentrated power of PAIN).

I tried that with BigWeave v7, but slightly different:

slices:
  - sources:
    - model: Xwin-70b
      layer_range: [0,17]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [8,25]
  - sources:
    - model: Xwin-70b
      layer_range: [17,33]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [25,41]
  - sources:
    - model: Xwin-70b
      layer_range: [33,49]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [41,57]
  - sources:
    - model: Xwin-70b
      layer_range: [49,65]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [57,73]
  - sources:
    - model: Xwin-70b
      layer_range: [65,80]
merge_method: passthrough
dtype: float16

Didn't upload because it seemed worse than Goliath 120. I can try a v7.1 with your suggestion unless @ChuckMcSneed beats me to it :)

Go ahead, @llmixer ! You have my blessing! I won't have time to do it in the next ~72h.

@jukofyork The model is up: llmixer/BigWeave-v7.1-124b
Some exl2 quants as well (3, 4, 5, 6bpw): https://huggingface.co/llmixer

@jukofyork The model is up: llmixer/BigWeave-v7.1-124b
Some exl2 quants as well (3, 4, 5, 6bpw): https://huggingface.co/llmixer

Did you give it a try?

Not thoroughly, just checked that it's not braindead. PPL seems to be higher than goliath, maybe skipping these layers (which would basically be replacing the individual layers with a bunch of layers) was intentional and part of the secret sauce.

Not thoroughly, just checked that it's not braindead. PPL seems to be higher than goliath, maybe skipping these layers (which would basically be replacing the individual layers with a bunch of layers) was intentional and part of the secret sauce.

Yeah, it could be - I looked through some of your other merges and saw the best performing were even more irregular.

Have you seen any cases yet where the merged PPL was actual less than the parent models?

I was only comparing to goliath, I didn't quant the base models. Most of the latest experiments were lower PPL than goliath.

image.png
I've been running some experiments on alternative merging patterns and techniques and the results seem very promising. Models still have weird spelling mistakes, but perform much better on my tests. EX-EX type merge(more details when I upload) is the first model that scores more than 13 on my SP test(creative writing), the next model below it is Gembo(my hyperoptimized model merged specifically to perform well on my tests), which has 12.75. More details+model upload soon(~28h), after I've done all the tests(retesting goliath too, just to confirm that it's not a false positive).

Sign up or log in to comment