TheBloke/Llama-2-70B-GPTQ · how to quant llama2 70b model with AutoGPTQ

tonycloud

Aug 4, 2023

now I want quant llama2 70b model, and I use AutoGPTQ, but I can not success.

tonycloud

Aug 4, 2023

Can you describe in detail how to quantify the llama270b model using AutoGPTQ？

TheBloke

Owner Aug 5, 2023

•

edited Aug 5, 2023

Firstly you need to update to AutoGPTQ 0.3.2, which I currently recommend is done by building from source:

pip3 uninstall -y auto-gptq
git clone https://github.com/PanQiWei/AutoGPTQ
cd AutoGPTQ
pip3 install .

There's no special steps for quantising Llama 2 70B compared to other models - as long as you're running AutoGPTQ 0.3.2 with Transformers 4.31.0 or later (which will be automatically installed when you update AutoGPTQ), it will work.

You can use my AutoGPTQ wrapper script: https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682#file-quant_autogptq-py

Example execution, to produce a 4-bit, 128g, act-order model:

 python3 quant_autogptq.py /path/to/Llama-2-70B /path/to/save/gptq wikitext --bits 4 --group_size 128 --desc_act 1 --damp 0.1 --dtype float16 --seqlen 4096 --num_samples 128 --use_fast

To quantise at sequence length 4096 (recommended) you will need a 48GB GPU. And you will need at least 200GB RAM in order to quantise and pack the model. Expect it to take 2-4 hours, depending on the speed of the system.

If you only have a 24GB GPU you can try --seqlen 2048 instead, which will also be quicker. But the quality of the quantisation won't be as good.

tonycloud

Aug 7, 2023

ok, thank you very much. I will test it.
Thank you very much for your work.

tonycloud

Aug 7, 2023

I have obtained the 4-bit quantized model, and would like to know whether this model can be loaded by text-generation-inference