Is there documentation for quantization alignment in long text?

#4
by Trangle - opened

Quantized on 300K tokens of two Vicuna format chats, a sci fi story and a fiction story at a long context. This should yield better storywriting performance than the default exl2 quantization.

You posted this at an interesting time. There is a discussion of just what it useful calibration data: https://github.com/ggerganov/llama.cpp/discussions/5006

As well as parallel discussions on Reddit and Discord.

In a nutshell, it appears that my strategy of "quantize on a lot of fiction" is possibly useless. Its not really worth documenting what I did because, as it turns out, its particularly bad for exllama below 4bpw. I would not recommend using this quantization, and instead lonestriker's generic quantixations for now.

Just a random update to this, I found my exl2s had very high perplexity at short context, but relatively low perplexity at long context.

Perhaps they were "overtuned" to long context.

Sign up or log in to comment