3-Bit quantization?

#3
by nicostouch - opened

Since llama.cpp using the new GGML format can split between CPU and GPU the performance of the q4_0 model offloading 40 layers to an RTX4090 and running the rest on 79003DX gets just under 2 tokens/s, which is just shy of feeling usable. I was wondering if it's possible to do 3-Bit quantization and if the trade-off in perplexity / speed might provide better output than the 30B models, while running at decent speed?

Yeah maybe. GPTQ does support 3bit quantisation (for CUDA only, not Triton). I haven't tested it, and have a feeling that very few people have and so there's quite possibly going to be issues and bugs.

But I will make a note to try it out with AutoGPTQ sometime soon and see how it goes.

Sign up or log in to comment