Not able to run the with 4bit quantization on RTX4090 (24GB vRAM) local GPU

#6
by miteshgarg - opened

Has anyone tried to run the model on local GPU RTX 4090 (24GB vRAM) with 4 bit Quantization (load_in_4bit = True)?
I am getting an error while running the same.
Without quantization model the model gives the query output in more than 10 mins.

Defog.ai org

Hi there, you will unfortunately not be be able to run this on an RTX 4090. The model has 70B parameters, which means that you'll need a minimum of 35GB VRAM just to load the weights, and around 40GB for reasonable inference.

Would recommend using the quantized version of our 34B model instead: https://huggingface.co/defog/sqlcoder-34b-alpha. Would also recommend running an AWQ version instead of loading it natively in 4 bits, as that tends to be a lot faster!

Sign up or log in to comment