Were all the quantizations produced w/ importance matrices?

#19
by venketh - opened

Were all the quantizations produced w/ importance matrices?

venketh changed discussion title from Were all the quantization produced w/ importance matrices? to Were all the quantizations produced w/ importance matrices?

do you mean this new imatrix? If yes, then the answer is no. Actually, I didn't even know about this. I always thought the IQ_1 models only need that data because it fails without it.

I am making one now for IQ_1 qunts:

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 110.547 ms
compute_imatrix: computing over 105 chunks with batch_size 512
compute_imatrix: 62.96 seconds per pass - ETA 1 hours 50.17 minutes
[1]2.9595,[2]2.4039,[3]2.4600,[4]2.4891,
save_imatrix: stored collected data after 10 chunks in imatrix.dat
[5]2.7522,[6]2.7359,[7]2.5348,[8]2.8533,[9]2.7622,
save_imatrix: stored collected data after 20 chunks in imatrix.dat
[10]3.0378,[11]3.1901,[12]3.1396,[13]3.3683,[14]3.6637,
save_imatrix: stored collected data after 30 chunks in imatrix.dat

@MaziyarPanahi

I tried the IQ4_XS and it had issues with extremely downgraded quality.

Tests has shown that loading a 8-bit might be sufficient to create the imatrix too if you don't have enough memory for F16.

Here's the link for more insight:
https://github.com/ggerganov/llama.cpp/discussions/5263

Thanks for your work once more!

@MaziyarPanahi

I tried the IQ4_XS and it had issues with extremely downgraded quality.

Tests has shown that loading a 8-bit might be sufficient to create the imatrix too if you don't have enough memory for F16.

Here's the link for more insight:
https://github.com/ggerganov/llama.cpp/discussions/5263

Thanks for your work once more!

Hi,

I do have enough memory, but I thought smaller quants would really benefit from this and Q4 should be fine. For instance, you cannot quantized to IQ_1 without having imatrix_data. It's mandatory.

Could you please test the Q4_K_M or Q4_K_S instead just to verify?

@MaziyarPanahi I'm sorry , what I ment was Q4_K_S (I had downloaded both)

It had repeating issues in the response (using LM Studio), not actual direct repeats but rather incorrect response when it analyzes the code test prompt I gave it, just repeated the same "issue" for every variable with the exact same response but changing the variables included with hallucinations.

IQ4_XS worked fine for the same prompt with great response, sometimes a very small hallucination nothing big.

I did not test it throughoutly as I had IQ4_XS did a good job already, deleted the download before I came to this thread , I'm limited by hardware (only 1 x rtx 4090) so it's very slow around 3 tokens/s and have to redownload it again and test more.

I recommend to use the imatrix regardless it gives you better quality during quantization, just make sure you have a good dataset for it as discussed in the thread I linked.

EDIT:
WizardLM q4_k_s worked fine too.

That seems to be a good advice. I am going to make new quants based on the imatrix this weekend and run some perplexity test. If it's true for everything, then I'll make it a default process. (it just takes forever to make it :D)

Great! =) If you don't take too big dataset it shouldnt take as long time if you have the memory for it, if you are limited by memory you can use 8-bit to produce the imatrix it is not much difference, cheers mate, looking forward to your continous work!

Sign up or log in to comment