int4 quantization

MetalX chose no groupsize so as to ensure it won't exceed 24GB VRAM. Mine does use groupsize for hopefully maximum quality, however it will exceed 24GB VRAM at somewhere around 1000-1200 tokens returned. A full 2048 token response may use as much as 35GB VRAM, although the number fluctuates (I saw a brief peak at 35GB, before it dropped to 28GB).

sebibi

Apr 14, 2023

Thanks @TheBloke , that's amazing !

Can we finetune a quantized model ?
Or should we finetune the model and then quantize it ?

TheBloke

Apr 14, 2023

•

edited Apr 14, 2023

Can we finetune a quantized model ?
Or should we finetune the model and then quantize it ?

The normal route is to fine tune first, then quantise afterwards. This is what most people do, and ensures the highest quality. But of course it also requires a lot of resources and VRAM to do the fine tuning.

I know of at least one project that aims to enable fine tuning on 4bit quantised models. I've not tried this myself yet, but check out this repo: https://github.com/johnsmith0031/alpaca_lora_4bit

sebibi

Apr 14, 2023

Great, I'll check it out :)

teknium

Apr 14, 2023

Great, I'll check it out :)
https://github.com/stochasticai/xturing/blob/main/examples/int4_finetuning/README.md
This claims you can make lora's on 4bit quantized models here

TheBloke

Apr 14, 2023

Nice, thanks for the link!

KnutJaegersberg changed discussion status to closed Apr 15, 2023

tensiondriven

Apr 22, 2023

FWIW (I hope this won't re-open the comment), I've been able to use https://github.com/johnsmith0031/alpaca_lora_4bit to fine-tune 4-bit quantized vanilla llama with a high degree of success. It was fiddly getting it set up, and the docs aren't great, but the loras it creates are usable with text-generation-webui (which also supports training, but not at 4 bit without a "monkey patch")

TheBloke

Apr 22, 2023

FWIW (I hope this won't re-open the comment), I've been able to use https://github.com/johnsmith0031/alpaca_lora_4bit to fine-tune 4-bit quantized vanilla llama with a high degree of success. It was fiddly getting it set up, and the docs aren't great, but the loras it creates are usable with text-generation-webui (which also supports training, but not at 4 bit without a "monkey patch")

Great, thanks for the info! I've been meaning to test out that repo - and also https://github.com/stochasticai/xturing/tree/main/examples/int4_finetuning - but haven't had a chance yet.

Can I ask how long it took, what HW you used etc?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment