The model does not load in "text generation webui" out of memory error

#6
by DanekBigLike - opened

2023-10-07 00:24:03 INFO:Loading TheBloke_WizardCoder-Python-34B-V1.0-GPTQ_gptq-4bit-64g-actorder_True...
2023-10-07 00:24:03 INFO:The AutoGPTQ params are: {'model_basename': 'model', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': {0: '24500MiB', 'cpu': '32600MiB'}, 'quantize_config': None, 'use_cuda_fp16': True, 'disable_exllama': False}
2023-10-07 00:24:35 ERROR:Failed to load the model.
Traceback (most recent call last):
File "E:\ai\ruai\saiga\text-generation-webui\modules\ui_model_menu.py", line 194, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name, loader)
File "E:\ai\ruai\saiga\text-generation-webui\modules\models.py", line 75, in load_model
output = load_func_maploader
File "E:\ai\ruai\saiga\text-generation-webui\modules\models.py", line 316, in AutoGPTQ_loader
return modules.AutoGPTQ_loader.load_quantized(model_name)
File "E:\ai\ruai\saiga\text-generation-webui\modules\AutoGPTQ_loader.py", line 57, in load_quantized
model = AutoGPTQForCausalLM.from_quantized(path_to_model, **params)
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\auto_gptq\modeling\auto.py", line 108, in from_quantized
return quant_func(
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\auto_gptq\modeling_base.py", line 875, in from_quantized
accelerate.utils.modeling.load_checkpoint_in_model(
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\accelerate\utils\modeling.py", line 1335, in load_checkpoint_in_model
checkpoint = load_state_dict(checkpoint_file, device_map=device_map)
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\accelerate\utils\modeling.py", line 1164, in load_state_dict
return safe_load_file(checkpoint_file, device=list(device_map.values())[0])
File "C:\Users\remot.conda\envs\textgen2\lib\site-packages\safetensors\torch.py", line 311, in load_file
result[k] = f.get_tensor(k)
RuntimeError: [enforce fail at ..\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 90177536 bytes.

Initially, 120 MB was not enough, I increased the swap file, 90 MB became missing, I think this is actually not related to memory, maybe I’m wrong.

I expanded the swap file from 32 to 64 GB

My system characteristics:
Windows 10 (miniconda)
RTX 3090 24gb
RAM 32gb
Swap file 70gb (drive C (auto) + drive D (64gb))

I am using Linux and loading with ExlamaHF. Took roughly 21 gb vram speed 16-17 tokens/s

@DanekBigLike
I would recommend using exllama(for gptq) or exllama v2(for exl2 quant format which is slightly higher quality and faster then gptq) since both of them take less vram and are much, much faster then auto gptq just like donymorph said

@donymorph any reason you are using ExllamaHF? Exllama is considerably faster but the only other difference is exllamahf has a few more samplers

Sign up or log in to comment