I was literally just trying to do this and kept running short of vram by 1.5GB, thank you!

#1
by AARon99 - opened

Do you know if it's possible to split vram usage using between two gpus with GPTQ's llama.py? Anyway thanks again, this was proving to be very difficult to do.

Great, glad to help. In general I don't recommend using GPTQ-for-LLaMa to quantise. Use AutoGPTQ.

No it's not. In both GPTQ-for-LLaMa and AutoGPTQ, you can't really control the VRAM required for quantisation. In AutoGPTQ you can control where the model weights go, but by default they go to RAM so moving some of the weights on to a second GPU won't help avoid you running out of VRAM, it might just make it a bit quicker.

But the one time I tried putting model weights on a second GPU while quantising with AutoGPTQ on the first - which I did on a 2 x A100 80GB system with max_memory = { 0: 0, 1: '79GiB', 'cpu': '500GiB' } - it failed with a weird error about GPU0 not being initialised so I'm not sure it can even work. But again that wouldn't have helped with VRAM issues - I was doing it to try to reduce the RAM requirement, and to try to speed the process up (that was quantizing BLOOMChat and BloomZ, 176B models). But it didn't work.

I used to quantise 65B models on a single 4090, so only 24GB VRAM should be required. You also need plenty of RAM - I think around 160GB RAM is needed for 65B. And if you have less than 24GB VRAM then I doubt you can do it, even with multiple cards.

There are two tricks you can use with AutoGPTQ to try to minimise VRAM usage. I'm not sure they'll help, because again if you don't have a 24GB card I don't think you can do it. But anyway, they are:

  1. Add cache_examples_on_gpu=False to the .quantize() call. This slightly reduces VRAM usage by not caching the quantisation samples on the GPU.
  2. Set environment variable PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 which stores data in smaller chunks on the GPU which can allow more to fit.

Thank you so much!!! Wow you have already been so valuable to the community with all the models, I really didn't expect such a detailed response. Thank you! I'm so excited to try out your suggestions today, I have 24GB of vram and 128GB of cpu ram, and just learned how to do swap files with WSL (I'm on a windows machine) so I can use nvme drives for that extra disk-ram.

<3

Ah then yeah it's going to be ram that's your issue, but yes if you define a large swap space then it should get it done eventually.

I'd use auto gptq not Llama.py as mentioned. Here's the script I use to quickly make AutoGPTQ quants using the wiki text dataset https://gist.github.com/TheBloke/b47c50a70dd4fe653f64a12928286682

Frick!! Thank you so much for the information!

I am currently quantizing the model using your .py file and AutoGPTQ!

I also got 64B quantitation working with GPTQ.

You have already provided so much information already, if you come across this post again and have the time. I'm curious why one is preferred over the other? I'm going to test both of these quantized models out when finished and do some comparisons between the two of them.

For anyone that comes across this post in the future, I am running AutoGPTQ using WSL on Windows 10, I did the git install instead of the pip install.

This is what I'm running with TheBloke's .py file that I named autogptq.py:

PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:32 CUDA_VISIBLE_DEVICES=0,1 python autogptq.py ~/AutoGPTQ/auto_gptq/quantization/llama-65b ~/AutoGPTQ/auto_gptq/quantization/llama-65b c4 --bits 4 --group_size 128 --desc_act 1 --damp 0.01 --dtype 'float16' --use_triton

Woot the model finished and works! Time to do more testing.

Sign up or log in to comment