3-Bit quantization?

by nicostouch - opened May 15, 2023

May 15, 2023

Since llama.cpp using the new GGML format can split between CPU and GPU the performance of the q4_0 model offloading 40 layers to an RTX4090 and running the rest on 79003DX gets just under 2 tokens/s, which is just shy of feeling usable. I was wondering if it's possible to do 3-Bit quantization and if the trade-off in perplexity / speed might provide better output than the 30B models, while running at decent speed?

TheBloke

Owner May 15, 2023

•

edited May 15, 2023

Yeah maybe. GPTQ does support 3bit quantisation (for CUDA only, not Triton). I haven't tested it, and have a feeling that very few people have and so there's quite possibly going to be issues and bugs.

But I will make a note to try it out with AutoGPTQ sometime soon and see how it goes.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment