File size: 2,193 Bytes
edf8872 a9407e0 cec8492 a9407e0 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 |
### What the hell is going on here?
I have a theory! But, I have to go to bed, so I'm setting this to upload while I sleep.
The 13Bs struggled because they were inherently lopsided. So, with this layout, I not only free up more parameters for further finetuning, I also address the imbalance. Crazy? Maybe.### What the hell is going on here?
I have a theory! But, I have to go to bed, so I'm setting this to upload while I sleep.
The 13Bs struggled because they were inherently lopsided. So, with this layout, I not only free up more parameters for further finetuning, I also address the imbalance. Crazy? Maybe.
### Results
Unsurprisingly, it is totally demented. It was worth a shot for science's sake, but watching the per-token perplexity and seeing WHERE it fails... I've just come to the conclusion this line of experimentation is indeed a dead end.
7b models are just too small per layer to have the kind of redundancy needed for multiple slices like this, leaving 11b merges as the only real viable enlarged Mistral. Even then, the problems seen here are scaled down but still apparent in 11b, right down to the pattern of what sequences cause massive perplexity spikes.
Perhaps, if one toyed with the layer placement just right, you could get a “solid” >7b Mistral merge. Even then, it would be smaller than I really want to work with. 70 billions and merges like Venus and Goliath prove what seems intuitive, higher parameter count models (when executed sanely) will outperform a smaller model at certain tasks.
My last foray into this will be a single-join merge that eats a little more into the layers at the beginning and end, hopefully my hypothesis that you can bleed into the last few layers more with Mistral is correct. But multiple joins is a dead-end.
### Recipe
slices:
- sources:
- model: chargoddard/loyal-piano-m7
layer_range: [0, 25]
- sources:
- model: NeverSleep/Noromaid-7b-v0.1.1
layer_range: [7, 25]
- sources:
- model: chargoddard/loyal-piano-m7
layer_range: [7, 25]
- sources:
- model: NeverSleep/Noromaid-7b-v0.1.1
layer_range: [7, 32]
merge_method: passthrough
dtype: bfloat16 |