Goliath-123b using same layer ranges as WinterGoliath-123b?

by jukofyork - opened Feb 18

Feb 18

Do you know if anybody has tried the same layer ranges as WinterGoliath-123b uses to recreate Goliath-123b:

slices:
  - sources:
    - model: xwin
      layer_range: [0, 16]
  - sources:
    - model: euryale
      layer_range: [8, 24]
  - sources:
    - model: xwin
      layer_range: [16, 32]
  - sources:
    - model: euryale
      layer_range: [24, 40]
  - sources:
    - model: xwin
      layer_range: [32, 48]
  - sources:
    - model: euryale
      layer_range: [40, 56]
  - sources:
    - model: xwin
      layer_range: [48, 64]
  - sources:
    - model: euryale
      layer_range: [56, 72]
  - sources:
    - model: xwin
      layer_range: [64, 80]
merge_method: passthrough
dtype: float16

instead of:

slices:
  - sources:
    - model: xwin
      layer_range: [0, 16]
  - sources:
    - model: euryale
      layer_range: [8, 24]
  - sources:
    - model: xwin
      layer_range: [17, 32]
  - sources:
    - model: euryale
      layer_range: [25, 40]
  - sources:
    - model: xwin
      layer_range: [33, 48]
  - sources:
    - model: euryale
      layer_range: [41, 56]
  - sources:
    - model: xwin
      layer_range: [49, 64]
  - sources:
    - model: euryale
      layer_range: [57, 72]
  - sources:
    - model: xwin
      layer_range: [65, 80]
merge_method: passthrough
dtype: float16

It seems strange that Goliath-120b uses these odd layers and makes me wonder if it was deliberate or if the creator thought the upper index was inclusive?

I'd be interested to see if this does anything to improve the "spelling mistake problem" that Goliath-120b has but not WinterGoliath-123b.

ChuckMcSneed

Owner Feb 19

I'll recreate it when I get the time. I'm currently making meme models again(12h making+2h testing per model+100% concentrated power of PAIN).

llmixer

Feb 19

I tried that with BigWeave v7, but slightly different:

slices:
  - sources:
    - model: Xwin-70b
      layer_range: [0,17]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [8,25]
  - sources:
    - model: Xwin-70b
      layer_range: [17,33]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [25,41]
  - sources:
    - model: Xwin-70b
      layer_range: [33,49]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [41,57]
  - sources:
    - model: Xwin-70b
      layer_range: [49,65]
  - sources:
    - model: Euryale-1.3-70b
      layer_range: [57,73]
  - sources:
    - model: Xwin-70b
      layer_range: [65,80]
merge_method: passthrough
dtype: float16

Didn't upload because it seemed worse than Goliath 120. I can try a v7.1 with your suggestion unless @ChuckMcSneed beats me to it :)

ChuckMcSneed

Owner Feb 19

Go ahead, @llmixer ! You have my blessing! I won't have time to do it in the next ~72h.

llmixer

Feb 20

@jukofyork The model is up: llmixer/BigWeave-v7.1-124b
Some exl2 quants as well (3, 4, 5, 6bpw): https://huggingface.co/llmixer

jukofyork

Feb 20

@jukofyork The model is up: llmixer/BigWeave-v7.1-124b
Some exl2 quants as well (3, 4, 5, 6bpw): https://huggingface.co/llmixer

Did you give it a try?

llmixer

Feb 21

Not thoroughly, just checked that it's not braindead. PPL seems to be higher than goliath, maybe skipping these layers (which would basically be replacing the individual layers with a bunch of layers) was intentional and part of the secret sauce.

jukofyork

Feb 21

Not thoroughly, just checked that it's not braindead. PPL seems to be higher than goliath, maybe skipping these layers (which would basically be replacing the individual layers with a bunch of layers) was intentional and part of the secret sauce.

Yeah, it could be - I looked through some of your other merges and saw the best performing were even more irregular.

Have you seen any cases yet where the merged PPL was actual less than the parent models?

llmixer

Feb 21

I was only comparing to goliath, I didn't quant the base models. Most of the latest experiments were lower PPL than goliath.

ChuckMcSneed

Owner Feb 23

I've been running some experiments on alternative merging patterns and techniques and the results seem very promising. Models still have weird spelling mistakes, but perform much better on my tests. EX-EX type merge(more details when I upload) is the first model that scores more than 13 on my SP test(creative writing), the next model below it is Gembo(my hyperoptimized model merged specifically to perform well on my tests), which has 12.75. More details+model upload soon(~28h), after I've done all the tests(retesting goliath too, just to confirm that it's not a false positive).

ChuckMcSneed

Owner Feb 24

Tests done, model uploaded!!!

https://huggingface.co/ChuckMcSneed/Premerge-EX-EX-123B

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment