Help regarding best quantization for below PC specification.
Since I don't want to spend hours download something that doesn't run. I am wondering which quantization would fit nicely in my system.
CPU: Intel core i7 13700k
Ram: 64GB
GPU: RTX 3090 24GB dedicated memory
Also is there any general rule of thumb that we should be following.
Hi,
I am about to upload the IQ-1 models, which are the smallest models. Those should fit without any issue.
Thanks for all the good work..
Hello,
I have roughly the same setup as you (RTX3090, 64GB RAM, Intel Core i9) and for my testing, I used TextGenerationWebui and the IQ1_M model loaded with 36 layers on the GPU.
Also, I had to limit the context size to 4096 and it seems that the max_new_tokens value has an impact on the quality of the results. I get better results with a max_new_tokens of 512 than with 1024.
However, it is very slow, I only get 1.6 tokens/s, so not really usable due to the delays.
I did a quick test last night with an RTX6000 which allows this model to be fully loaded in VRAM and I was getting around 25 tokens/s. In my opinion, this model requires more power than what a standard gaming PC has.