IQ1_S or IQ_M for low RAM/VRAM computers

#20
by teneriffa - opened

Or if you upload the imatrix.dat, this will be very welcome for poor computers.

So I tried to do 1-bit, it asked for imatrix data! I have never done that, could you tell me how to do it? I can do the imatrix and share all the 1Q models quickly

  1. Grab a copy of group_10_merged.txt from https://github.com/ggerganov/llama.cpp/discussions/5263
  2. W/ the f16 gguf file, run: ~/llama.cpp/imatrix -m ggml-model.f16.gguf -f group_10_merged.txt
  3. Wait a while;
  4. When running quantize, add this arg: --imatrix imatrix.dat
    \o/
    (The quality of all your other low-bit-rate quantizations will improve as well!)
  1. Grab a copy of group_10_merged.txt from https://github.com/ggerganov/llama.cpp/discussions/5263
  2. W/ the f16 gguf file, run: ~/llama.cpp/imatrix -m ggml-model.f16.gguf -f group_10_merged.txt
  3. Wait a while;
  4. When running quantize, add this arg: --imatrix imatrix.dat
    \o/
    (The quality of all your other low-bit-rate quantizations will improve as well!)

Will do this in an hour! Thanks a lot! So I do this for IQ1_S and IQ1_M?

At minimum. The same imatrix.dat file can be used for all quantization levels though - it would be good to remake any of the IQ*'s at minimum, any of the other ones you can!

At minimum. The same imatrix.dat file can be used for all quantization levels though - it would be good to remake any of the IQ*'s at minimum, any of the other ones you can!

You seem to have more knowledge about this imatrix, is it for all the quantized models starting with IQ regardless of their size? If yes, why isn't it happening automatically inside the quantize script? (just asking out of curiosity)

I still have the 16bit which takes forever to make, I will do the imatrix and start with the 1bits, then see what other IQ I have

imatrix.dat is effective for quants Q5_K_M or smaller. Even perplexities of Q4_0 or Q3_K_S will be better with imatrix.dat.

So not the groups_merged.txt, but the group_10_merged.txt?

So not the groups_merged.txt, but the group_10_merged.txt?

I prefer groups_merged.txt but it’s up to you. Someone uses wiki.train.raw from wikitext and it is very large.

OK I'll go with groups_merged.txt which seems to be more diverse

system_info: n_threads = 64 / 128 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 |
compute_imatrix: tokenizing the input ..
compute_imatrix: tokenization took 110.547 ms
compute_imatrix: computing over 105 chunks with batch_size 512
compute_imatrix: 62.96 seconds per pass - ETA 1 hours 50.17 minutes
[1]2.9595,[2]2.4039,

I have uploaded both IQ1_S and IQ1_M, the IQ1_M took a long time! I think the imatrix made this one much longer. I'll see if I can evaluate the other quants and see how much difference the imatrix would have

I have uploaded both IQ1_S and IQ1_M, the IQ1_M took a long time! I think the imatrix made this one much longer. I'll see if I can evaluate the other quants and see how much difference the imatrix would have

I really appreciate it! Thank you very much!!!

teneriffa changed discussion status to closed

Thank you for sharing how to do imatrix, appreciate it! :)

Sign up or log in to comment