TheBloke/Yi-34B-200K-Llamafied-GPTQ · When loading 8-bits 128g onto A40 (48GB VRAM), an error occurs: CUDA out of memory.

Nov 15, 2023

Using the AutoGPTQ to load an 8-bit 128g model in Text-generation-webui, with the trust-remote-code and disable_exllama options enabled, results in the following error message:
Traceback (most recent call last):
File "/home/kaiz/workshop/text-generation-webui/modules/ui_model_menu.py", line 210, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "/home/kaiz/workshop/text-generation-webui/modules/models.py", line 85, in load_model
output = load_func_maploader
File "/home/kaiz/workshop/text-generation-webui/modules/models.py", line 337, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
File "/home/kaiz/workshop/text-generation-webui/modules/AutoGPTQ_loader.py", line 58, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
File "/home/kaiz/anaconda3/envs/aigc/lib/python3.10/site-packages/auto_gptq/modeling/auto.py", line 108, in from_quantized
return quant_func(
File "/home/kaiz/anaconda3/envs/aigc/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 902, in from_quantized
cls.fused_attn_module_type.inject_to_model(
File "/home/kaiz/anaconda3/envs/aigc/lib/python3.10/site-packages/auto_gptq/nn_modules/fused_llama_attn.py", line 154, in inject_to_model
qweights = torch.cat([q_proj.qweight, k_proj.qweight, v_proj.qweight], dim=1)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 64.00 MiB. GPU 0 has a total capacty of 44.40 GiB of which 19.31 MiB is free. Including non-PyTorch memory, this process has 44.38 GiB memory in use. Of the allocated memory 42.91 GiB is allocated by PyTorch, and 698.05 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

AutoGPTQ ver: 0.4.2
CUDA ver: 12.1
nVidia Driver ver: 530.41.03

34B 4bit can be successfully loaded, llama2 70B 4bit can also be successfully loaded. I'm not sure what caused the failure to load the 34B 8bit model (in theory, it should be able to load with 48G VRAM).

TheBloke

Owner Nov 15, 2023

I've not tested with 8-bit GPTQs much.

One potential issue is that 8-bit models can't use the ExLlama kernels, and the ExLlama kernels use less VRAM. So it's possible that 34B 8-bit just needs more VRAM than 48GB.

You could try use the 8-bit group_size None instead, that will use less VRAM.

Yhyu13

Nov 16, 2023

@ishotoli

This is the 200K ctx len model, my best guess is that fused attention injection which is present in your error msg has allocated too much memory? May be use --no_inject_fused_attention in cmd arg.

ishotoli

Nov 16, 2023

It works! Thx :)

ishotoli changed discussion status to closed Nov 16, 2023