What is the different between GPTQ and QLoRA?

#12
by Ichsan2895 - opened

That being said "This repo contains an experimantal GPTQ 4bit model for Falcon-40B-Instruct."
But, recently I found a tutorial in a website that claimed this one is QLoRa version.

AFAIK, the QLoRA version is using bitsandbytes library. Activating bnb_4bit_quant_type='nf4' and load_in_4bit=True. I feel it more faster than GPTQ version. I tested both in runpods with A6000 48GB VRAM.

So, lets wrap my question. It was the same thing and whar is the different?

This comment has been hidden

@Ichsan2895 sorry, I missed this question when it was first posted. Assuming you've not found the answer yet:

  1. No, this repo is not related to QLoRA in any way. Falcon 40B Instruct was not trained as a QLoRA.
  2. QLoRA does use bitsandbytes yes. It uses it to load an unquantised model in 4bit, and then it does training on that model.
  3. You can also use bitsandbytes to do 4bit and 8bit inference, as you mention. It is an automatic quantisation library, that applies the quantisation 'live', rather than requiring the user to save a quantised version in advance.
  4. GPTQ is also a quantisation library , but it works by saving a quantised model and then loading that quantised model. It can't do it as you load the model. Generally GPTQ is much faster than bitsandbytes.
  5. But yes, for Falcon models only, bitsandbytes is faster than GPTQ. That's because there's a major performance problem with Falcon GPTQ at the moment. It is being looked at but there is no solution yet.

Recently, after you posted, we also got Falcon GGML support and this is much faster than GPTQ and may be faster than bitsandbytes as well. I have a repo for falcon-40b-GGML which you could try using ctransformers (Python library) or using the LoLLMS-UI. They're not yet supported in text-generation-webui.

Hope that helps.

Sign up or log in to comment