Have you considered "grafting" the initial layers only of `Miqu-1` or the `codellama-70b` base model?
Have you tried to keep the original Goliath
and just try merging the first few layers of Miqu-1
in?
I did a lot of experimentation with trying to merge coding models and found that the last 2 layers seem to be key to how the models transform their internal latent representation to the required categorical distribution, and wouldn't be surprised if the first few layers are responsible for transforming the rotary positional embeddings into the model's latent space.
It also seem that the last layers are responsible for interpreting the prompt format:
- The
deepseek-coder
model will let you stack the full 62 layers of the base model on top of the 62 layers of the instruct model and functions almost perfectly (and will follow instructions, etc) apart from being unable to stop! - The
codellama
model (and fine-tunes of it) will let you stack the the first 46 layers of the base model (ie: all but the last 2 layers) on top of the the instruct model and again it it works 99% fine apart from being unable to stop and the occasional weird word when writing out large sections of code.
If you do it the other way then you get gibberish and it seems the last layers are the ones that are most important for keeping the "character" of the model and the way it reads the prompt format in (contradictory to what I would have assumed!).
If the early transformer blocks are just encoding high level features in a similar way to the early layers in a convnet, then there is a good chance that a "grafted" version of Goliath
and Miqu-1
will actually behave almost the same (but hopefully pick up the long context ability).
Another possible donor to look at is the recently released codellama-70b
base model as it too has a longer context of 16k (sadly the python and instruct models are only 4k for some reason).