Any chance of a 16K model?

#2
by MB7977 - opened

Thank you for your wonderful work. Unfortunately, on Linux, even with FA2, the 128g GPTQ version of this model cannot be loaded at 32K context with 2x3090s. Are there any plans to train a 16K version that would be useable for a broader audience? Truncating max_seq_length to 16K on load seems to degrade performance. I'm going to quantize in EXL2 format so that I can load at 32K, but it will mean a very low bit-rate.

If you use rope scaling = 8 and max_seq_len = 16K it should perform like a 16K model (make sure you don't use rope scaling = 4). It flat out beats any 16K fine-tune I've made on raw perplexity at 16K. Maybe with a lot more training at rope scaling = 4 an exclusive 16K model might do better? But I don't think that's worth that much - the PPL drops monotonically all the way to 32K at rope scaling = 8.

Interesting, thank you. I'll give that a go. I was trying 16K at a scaling factor of 4.

Sign up or log in to comment