int4 quantization

#1
by KnutJaegersberg - opened

@TheBloke Can you quantize this lora model, too?

KnutJaegersberg changed discussion title from 4int quantization to int4 quantization

This might be the best general model right now on the hub, that people can use with consumer hardware.

Nice! I'm on it.

Took me a while and @MetaIX beat me to it! But we've used different parameters so you may want to try both.

MetalX chose no groupsize so as to ensure it won't exceed 24GB VRAM. Mine does use groupsize for hopefully maximum quality, however it will exceed 24GB VRAM at somewhere around 1000-1200 tokens returned. A full 2048 token response may use as much as 35GB VRAM, although the number fluctuates (I saw a brief peak at 35GB, before it dropped to 28GB).

Thanks @TheBloke , that's amazing !

Can we finetune a quantized model ?
Or should we finetune the model and then quantize it ?

Can we finetune a quantized model ?
Or should we finetune the model and then quantize it ?

The normal route is to fine tune first, then quantise afterwards. This is what most people do, and ensures the highest quality. But of course it also requires a lot of resources and VRAM to do the fine tuning.

I know of at least one project that aims to enable fine tuning on 4bit quantised models. I've not tried this myself yet, but check out this repo: https://github.com/johnsmith0031/alpaca_lora_4bit

Great, I'll check it out :)

Great, I'll check it out :)
https://github.com/stochasticai/xturing/blob/main/examples/int4_finetuning/README.md
This claims you can make lora's on 4bit quantized models here

Nice, thanks for the link!

KnutJaegersberg changed discussion status to closed

FWIW (I hope this won't re-open the comment), I've been able to use https://github.com/johnsmith0031/alpaca_lora_4bit to fine-tune 4-bit quantized vanilla llama with a high degree of success. It was fiddly getting it set up, and the docs aren't great, but the loras it creates are usable with text-generation-webui (which also supports training, but not at 4 bit without a "monkey patch")

FWIW (I hope this won't re-open the comment), I've been able to use https://github.com/johnsmith0031/alpaca_lora_4bit to fine-tune 4-bit quantized vanilla llama with a high degree of success. It was fiddly getting it set up, and the docs aren't great, but the loras it creates are usable with text-generation-webui (which also supports training, but not at 4 bit without a "monkey patch")

Great, thanks for the info! I've been meaning to test out that repo - and also https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning - but haven't had a chance yet.

Can I ask how long it took, what HW you used etc?

Sign up or log in to comment