I've sen you doing this already with Bigstral 12B...

#1
by Kalemnor - opened

What is the point of self merging the same base model with overlapping layers? Can you explain the benefits, if any?

It's smarter. It hallucinates less, it remembers more of it's knowledge. It scores higher. Venus and MegaDolphin were made this way.

Kalemnor changed discussion status to closed
Kalemnor changed discussion status to open

Wouldn't be a good Idea to find a sweet spot for the overlapping areas thus to increase the quality but still keeping the parameters in touch as not to increase the size too much?

@Kalemnor I wrote a big reply for both of you and then lost it clicking on your name by accident. :(

First of all, in principle you could just do it in software. cp --reflink=always last.op next.op actually Turing complete! Yet things like this never seem to go anywhere.

Once you take a model like this and continue training, that's... not exactly what happens, but it's convenient to think of it what way!
It IS what you're left with once you quantize again. Train on scaling laws (more parameters > training length) then infer on on quantization '""'laws"''' (more parameters > precision).

I am suspicious about the raw output of a merge like this. It suggests that transformers don't understand repetition and that information theory is just a theory.
That doesn't really agree with my understanding of the world, so rather than get better theory, I'm going to throw software at my doubts.

@ehartford you test these unquantized, right?
I am going to make myself more suspicious by putting this and regular Yi in to the same compression algorithms until they can come out the same size - or better - an archive of both of them smaller than BigYi decompressed.

In even less plain English I'm feeling cocky about showing the filesizes where:
15 / 9 >> BigYi.zpaq / Yi.zpaq >= BigYi.tar / BigYi.zpaq

(this term on the rhs tends to be about ~1.2 I.M.E. )
and my hunch

(BigYi + Yi).zpaq < Yi.tar_ WAIT

@Ehartford Can I ask if you quantized this to fp16 before or after you merged it? * The original was published in BF16 so the ~256 possible exponents have been crushed down to ~64 (implicit sign hurts my brain). That's lossy AFAIK, especially if you're going to be doing huge multiplications on FP32-sized values (which is what BF16 is for - Is that how this kind of merge works?)

Bonus question: no chance mergekit uses TF32 or even just FP32 to do what it does? (speculation removed I'm just gonna download it)

*can I also ask... why?

The knowledge is embedded in the layers

As the layers are repeated it gets a 2nd chance to remember things

Therefore it remembers better and hallucinates less

ehartford changed discussion status to closed

Sign up or log in to comment