chargoddard/mixtralmerge-8x7B-rebalanced-test

This is a dumb experiment - don't expect it to be good!

I merged a few Mixtral models together then tuned only the routing parameters. There was a pretty steep drop in loss with only a bit of training - went from ~0.99 to ~.7 over about ten million tokens.

I'm hoping this after-the-fact balancing will have reduced some of the nasty behavior typical of current tunes. But maybe it just made it even dumber! We'll see.

Uses ChatML format.

Will update with more details if it turns out promising.

chargoddard
/

mixtralmerge-8x7B-rebalanced-test

Dataset used to train chargoddard/mixtralmerge-8x7B-rebalanced-test