Could you quantize this at Q4 please?

#1
by Surprisekitty - opened

Quantizing this at Q5 put's it just out of the range of being run by standard 24GB cards like the 3090. Having it quantized at Q4 would allow a single 3090 to run the model rather than having to splurge and get two of them. At which point you could run a 70B model. Anyone who wants to run a 34B most of the time runs a 4.65bpw for Exllama or a Q4_K_M quant. If you could please release a Q4 quantization, that would be much appreciated.

Surprisekitty changed discussion title from Why Q5 quant and not Q4? to Could you quantize this at Q4 please?

Sorry for late response. Will upload q4_k_m quant in a bit.

Uploaded.

Sign up or log in to comment