Text Generation
Transformers
Safetensors
English
llama
causal-lm
text-generation-inference
4-bit precision

AttributeError: 'Offload_LlamaModel' object has no attribute 'preload'

#17
by yramshev - opened

loaded the model successfully, however prompts generate the following error: AttributeError: 'Offload_LlamaModel' object has no attribute 'preload'

Traceback (most recent call last):
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 259, in generate_with_callback
shared.model.generate(**kwargs)
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 135, in forward
if idx <= (self.preload - 1):
File "D:\Documents\DEVELOPMENT\LanguageModels\oobabooga_windows\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Offload_LlamaModel' object has no attribute 'preload'

The same. Need a solution =/

found a solution... change the GPTQ repository for the old-cuda branch, install it again.

image.png

If you are using oobabooga/one-click-installers, you can change GPTQ branch in webui.py file, from "-b cuda" to "-b old-cuda", and run update script

I've tried the "-b old-cuda" still doesn't work. I did get it to work once after that getting this error:

Traceback (most recent call last):
File "C:\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "C:\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 263, in generate_with_callback
shared.model.generate(**kwargs)
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1572, in generate
return self.sample(
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2619, in sample
outputs = self(
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 688, in forward
outputs = self.model(
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl return forward_call(*args, **kwargs)
File "C:\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 135, in forward
if idx <= (self.preload - 1):
File "C:\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Offload_LlamaModel' object has no attribute 'preload'
Output generated in 0.43 seconds (0.00 tokens/s, 0 tokens, context 1397, seed 2017233876)

You should update to the latest version of text-generation-webui, and then launch it. That is enough to get the model working.

These errors indicate you're either running a really old version of textgen, or else you're trying to use GPTQ-for-LLaMA still, eg with --wbits 4 --model_type llama --groupsize 128 arguments, or by setting bits = 4, model_type = llama in the UI

None of that is needed any more as now AutoGPTQ is default and that will auto load the model

I hadn't updated the README of this model to reflect that, so it was still telling people to set GPTQ params. I've edited that now

I updated Ooba today and I still get this error. Loading the model works, but after sending an input, this error comes up.

2023-06-18 00:28:58 INFO:Loaded the model in 3.20 seconds.

Traceback (most recent call last):
File "F:\Programme\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 74, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "F:\Programme\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 263, in generate_with_callback
shared.model.generate(**kwargs)
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1572, in generate
return self.sample(
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2619, in sample
outputs = self(
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 688, in forward
outputs = self.model(
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "F:\Programme\oobabooga_windows\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 135, in forward
if idx <= (self.preload - 1):
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1614, in getattr
raise AttributeError("'{}' object has no attribute '{}'".format(
AttributeError: 'Offload_LlamaModel' object has no attribute 'preload'
Output generated in 0.27 seconds (0.00 tokens/s, 0 tokens, context 1074, seed 2123655593)

Somehow it's still trying to load GPTQ-for-LLaMA. Show a screenshot of your Models tab

Somehow it's still trying to load GPTQ-for-LLaMA. Show a screenshot of your Models tab

OK, here ... thanks for looking at it.

Zwischenablagebild (3).png

Sorry I meant the whole of the Models tab, not just the Model dropdown. Please select this model in the Model dropdown, and then show me a full screenshot of everything you see on screen on that tab.

Ah, OK, no problem, here it is:

Zwischenablagebild.png

In the console this comes up:
To create a public link, set share=True in launch().
2023-06-21 17:21:37 INFO:Loading TheBloke_stable-vicuna-13B-GPTQ...
2023-06-21 17:21:37 INFO:The AutoGPTQ params are: {'model_basename': 'stable-vicuna-13b-gptq-4bit.compat.no-act-order', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
2023-06-21 17:21:38 WARNING:The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
2023-06-21 17:21:38 WARNING:The safetensors archive passed at models\TheBloke_stable-vicuna-13B-GPTQ\stable-vicuna-13b-gptq-4bit.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.

I maybe should mention, that I have a 8GB VRAM card. I'm not sure if this is relevant for this error.

That's a different error now!

Now you are simply running out of VRAM. And yes, 8GB card is very relevant. 8GB is not enough to load a 13B model, you need 12GB minimum. 8GB isn't even enough to load a 7B model using AutoGPTQ.

Two options:

  1. Use a 7B GPTQ model, with the new ExLlama loader. This is now supported under text-generation-webui, though you need to install ExLlama manually: https://github.com/turboderp/exllama

  2. Or, try a 13B GGML instead. With GGML 13B, eg a q4_K_M.bin file, you can use the llama.cpp loader. By default it is only for CPU inference, but you can enable GPU acceleration by compiling llama-cpp-python with CUDA support by following these instructions: https://github.com/abetlen/llama-cpp-python#installation-with-openblas--cublas--clblast--metal

First you will need to get CUDA toolkit installed.

Then for Windows, the commands are:

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

Once this is done successfully, you can:

  • Download a GGML model file, like stable-vicuna.ggmlv3.q4_K_M.bin
  • In text-gen, choose the llama.cpp loader
  • Set "n-gpu-layers" to 40 (if this gives another CUDA out of memory error, try 35 instead)
  • Set Threads to 8

Now you are simply running out of VRAM.

Ah, you're right. I hadn't looked at this, sorry. πŸ€ͺ
I tried using CPU, then it loads. But after input a text I get:

To create a public link, set share=True in launch().
2023-06-21 18:04:08 INFO:Loading TheBloke_stable-vicuna-13B-GPTQ...
2023-06-21 18:04:09 INFO:The AutoGPTQ params are: {'model_basename': 'stable-vicuna-13b-gptq-4bit.compat.no-act-order', 'device': 'cpu', 'use_triton': False, 'inject_fused_attention': True, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None}
2023-06-21 18:04:10 WARNING:The model weights are not tied. Please use the tie_weights method before using the infer_auto_device function.
2023-06-21 18:04:10 WARNING:The safetensors archive passed at models\TheBloke_stable-vicuna-13B-GPTQ\stable-vicuna-13b-gptq-4bit.compat.no-act-order.safetensors does not contain metadata. Make sure to save your model with the save_pretrained method. Defaulting to 'pt' metadata.
2023-06-21 18:05:52 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-06-21 18:05:53 INFO:Loaded the model in 104.87 seconds.
============================================================
loading character assistant
Traceback (most recent call last):
File "F:\Programme\oobabooga_windows\text-generation-webui\modules\callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "F:\Programme\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 263, in generate_with_callback
shared.model.generate(**kwargs)
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_base.py", line 422, in generate
with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):
File "F:\Programme\oobabooga_windows\installer_files\env\lib\site-packages\auto_gptq\modeling_base.py", line 411, in device
device = [d for d in self.hf_device_map.values() if d not in {'cpu', 'disk'}][0]
IndexError: list index out of range
Output generated in 0.33 seconds (0.00 tokens/s, 0 tokens, context 42, seed 540771681)

8GB is not enough to load a 13B model, you need 12GB minimum. 8GB isn't even enough to load a 7B model using AutoGPTQ.

Well, your model is 6.75GB. I thought this is OK. I always thought it's about the filesize.
I still haven't found out, what models can/should be used depending on VRAM. I currently still don't understand, why I can use the 12.4GB facbook model, or the 11.2GB pygmalion model from AlekseyKorshuk or even the 12.5GB pygmalion model from Imablank, but not your way smaller model. πŸ€”
Maybe I may ask you for a brief guide on what to look for?

And many thanks for the detailed information, I will look into it. πŸ‘πŸ™‚

Sign up or log in to comment