Error using ooba-gooba

#6
by blueisbest - opened

Hey! I dont really know what I did, Imma just post the error I got. I also followed this tutorial: https://youtu.be/nVC9D9fRyNU

Traceback (most recent call last):
File "C:\Users\admin\Desktop\oobabooga-windows\text-generation-webui\server.py", line 302, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\Users\admin\Desktop\oobabooga-windows\text-generation-webui\modules\models.py", line 100, in load_model
from modules.GPTQ_loader import load_quantized
File "C:\Users\admin\Desktop\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

Watch the video again and follow the directions more carefully.

Considering your title, "Error using ooba-gooba", your key clue should be:
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

@blueisbest did you manage to fix the issue, im getting the same error

It seems you're using an AMD CPU which has limited support, yep, we got the same problem. Sadly your only way of using this is the LlamaCPP method.

If someone here got any other methods, consider sharing one. Thanks

Why would having an amd cpu be an issue? Haven't heard about that before.

You need to update your start-webui bat file so the call python server.py has the arguments --chat --wbits 4 --groupsize 128.
That fixed it for me.

I am on the amd cpu too, and getting this error,
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

I get the same error on Intel Xeon Cascade Lake P-8259L / NVIDIA T4 (Ubuntu 20.04)

I have the same issue. It looks like something is wrong. If during installation you select CPU, then it skips installation of GPTQ, but once you run start-webui it tries to load GPTQ.

I fixed the issue with changing "!python server.py --chat --wbits 4 --groupsize 128 --auto-devices" to "!python server.py --chat --auto-devices", but I got another error "FileNotFoundError: [Errno 2] No such file or directory: 'models/gpt-x-alpaca-13b-native-4bit-128g/pytorch_model-00001-of-00006.bin'" though. Not sure if the situation is better or worse.

same problem here.

Also passed --wbits 4 --groupsize 128 to server.py

I get ther error after following https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model: "Could not find the quantized model in .pt or .safetensors format, exiting...", Any ideas?

I get ther error after following https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model: "Could not find the quantized model in .pt or .safetensors format, exiting...", Any ideas?

Either you are trying to load not quantized model if so remove --wbits 4 --groupsize 128, or you didn't download quantized model into models folder.

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

I am getting "OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root."

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

didn't you need a cuda gpu for this to work?

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

didn't you need a cuda gpu for this to work?

I am using a cuda GPU.

With GPU it works fine, you start to have problems when you try to use it with CPU.

With GPU it works fine, you start to have problems when you try to use it with CPU.

oh sry. haven't tried CPU yet. trying to get it working good on AWS gpu before I try to reduce cost.

import llama_inference_offload

ModuleNotFoundError: No module named 'llama_inference_offload'

Having the exact same issue

I have the same exact problem and I'm using an AMD CPU, therefore my concern is whether it is fixable. Should I wait for the fix or delete everything I've downloaded?

I GOT THIS ERROR TOO . PLEASE HELP

CUDA SETUP: Loading binary E:\vicuna-chatgpt4\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll...
E:\vicuna-chatgpt4\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
Traceback (most recent call last):
File "E:\vicuna-chatgpt4\oobabooga-windows\text-generation-webui\server.py", line 302, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "E:\vicuna-chatgpt4\oobabooga-windows\text-generation-webui\modules\models.py", line 100, in load_model
from modules.GPTQ_loader import load_quantized
File "E:\vicuna-chatgpt4\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'
Press any key to continue . . .

Trying on M1:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
Traceback (most recent call last):
File "/Users/aryasarukkai/text-generation-webui/server.py", line 302, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/Users/aryasarukkai/text-generation-webui/modules/models.py", line 100, in load_model
from modules.GPTQ_loader import load_quantized
File "/Users/aryasarukkai/text-generation-webui/modules/GPTQ_loader.py", line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

I've tried with all solutions in this thread so far- no luck unfortunately.

To solve the "ModuleNotFoundError: No module named 'llama_inference_offload'" problem I followed the advice of @synthetisoft and did:

mkdir \oobabooga-windows\text-generation-webui\repositories
cd repositories
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda

It worked, I got passed that stage.


Now I get stuck at this line: "storage = cls(wrap_storage=untyped_storage)"

I think I don't have enough memory, or free memory. I have roughly 7GB of free memory.
Memory consumption spikes to 100% and the script crashes.

Here's my info up to that point:


UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models\anon8231489123_vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Loading model ...

[
3 times the same warning for files storage.py:899, _utils.py:776 and torch.py:99:
UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
]
storage = cls(wrap_storage=untyped_storage)
Press any key to continue . . . [That's where the script crashes]


To the line that starts the web ui in start-webui.bat:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128

I tried adding "--pre_layer 20" (tried --pre_layer 5, 50, 100, 1000 ) with no success. Same result.


My specs:

Windows 11 16GB memory, no GPU (to speak of).

CPU

Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

Base speed:	2.90 GHz
Sockets:	1
Cores:	6
Logical processors:	12
Virtualization:	Enabled
L1 cache:	384 KB
L2 cache:	1.5 MB
L3 cache:	12.0 MB

Memory

16.0 GB

Speed:	2666 MHz
Slots used:	2 of 4
Form factor:	DIMM
Hardware reserved:	151 MB

GPU 0 (on-board)

Intel(R) UHD Graphics 630

Driver version:	30.0.101.1273
Driver date:	14/01/2022
DirectX version:	12 (FL 12.1)
Physical location:	PCI bus 0, device 2, function 0

Utilization	1%
Dedicated GPU memory	
Shared GPU memory	0.4/7.9 GB
GPU Memory	0.4/7.9 GB

Tried changing the flag back to --cai-chat , now I get a different error, saying I have not enough memory:


UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models\anon8231489123_vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Traceback (most recent call last):
[... python files with their code lines]
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 35389440 bytes.
Press any key to continue . . .

OK, so I tried bumping up my virtual memory and it seems to have helped. I now get past THIS memory issue, but I get an error saying that pytourch was installed wihtout conda support. or the other way around... This is as far as I've got for now. Going back to my real life :-)

I've been trying to fix this problem for days now and here what i learned so far:

  1. You got this error " import llama_inference_offload ModuleNotFoundError: No module named 'llama_inference_offload' " Because you're missing the repositories folder and the
    GPTQ-for-LLaMa folder you can follow @synthetisoft by follow step 1 or try this fix if it's still doesn't work https://github.com/oobabooga/text-generation-webui/issues/416#issuecomment-1475078571

  2. You get this error because you install it with CPU in mind if you did that the repositories folder will be missing because it's meant for the GPU CUDA models. Another reason is that you're trying to load a
    GPU models without the GPTQ-for-LLaMa and CUDA package install you can regconize the CPU model by it prefix " ggml- ".

  3. I tried to fix everything by following all the step above but i can't make it work with CPU. so i came to conclusion that it's only work for NVIDIA GPU.

       + Solution:
    
  • For those of you who're trying to use it with CPU i got good new for you guys there's an alternative and it's very simple. it call " Koboldcpp " it's like llamacpp but with Kobold Webui
    you can have all the feature that oobabooga have to offer if you don't mind learning how to use the Kobold webui.

  • For those of you who're trying to use it with GPU sorry it's only work with CPU for now.

       + Installation :
    
  1. Go to " https://github.com/LostRuins/koboldcpp " you can read the description if you want.
  2. Scroll down to Usage you will see the blue Download link click on it.
  3. You can read the description of how to use it and click download the koboldcpp.exe
  4. After that you can download the CPU model of the GPT x ALPACA model here: https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/main/gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g .It will take a while since it's about 10GB.
  5. Just drag and drop the model or manually search for the ggml model yourself this work for every CPU model. Next wait until it finished loading the model and copy the http://localhost:5001
    and paste it on your browser.
  6. you can find out more about koboldcpp and how to use it here: https://www.reddit.com/r/LocalLLaMA/comments/12cfnqk/koboldcpp_combining_all_the_various_ggmlcpp_cpu/

That's all for now hope this help.

The script has been updated for CPU usage. Run 'iex (irm vicuna.tb.ag)' in powershell (in the directory you want to install the ui in) and the first question will allow you to install just the updated CPU mode of the model. I first deleted the ooga-booga directory and ran the script. It loaded a lot to memory and page file, but it works pretty well on my i5 16GB with lots of chrome tabs open... Great work.

see this vieo:
https://youtu.be/d4dk_7FptXk

I tried it, and it also seems to work with the GPT4 x Alpaca CPU model. But it uses 20 GB of my 32GB rams and only manages to generate 60 tokens in 5mins. In koboldcpp i can generate 500 tokens in only 8 mins and it only uses 12 GB of my RAM. I don't know how it manages to use 20 GB of my ram and still only generate 0.17token/s I guess I'll stick koboldcpp. But what about you did you get a faster generation when you use the Vicuna model?

I tried it, and it also seems to work with the GPT4 x Alpaca CPU model. But it uses 20 GB of my 32GB rams and only manages to generate 60 tokens in 5mins. In koboldcpp i can generate 500 tokens in only 8 mins and it only uses 12 GB of my RAM. I don't know how it manages to use 20 GB of my ram and still only generate 0.17token/s I guess I'll stick koboldcpp. But what about you did you get a faster generation when you use the Vicuna model?

I'm not sure how to measure the token generation rate. It takes a bit of time for the model to start responding the first time it is loaded. But subsequent interactions are faster. Slower than GPT4, but faster than I can manage to read and understand... meaning it types faster than my in-depth reading speed. I guess if I try I can read faster than it types, or perhaps I'm just a slow reader :P English isn't my 1st language. Again, my specs are listed above, but a simple i5 16GB ram 240GB SSD, windows11 auto-managed page file. It works.

I tried it, and it also seems to work with the GPT4 x Alpaca CPU model. But it uses 20 GB of my 32GB rams and only manages to generate 60 tokens in 5mins. In koboldcpp i can generate 500 tokens in only 8 mins and it only uses 12 GB of my RAM. I don't know how it manages to use 20 GB of my ram and still only generate 0.17token/s I guess I'll stick koboldcpp. But what about you did you get a faster generation when you use the Vicuna model?

I'm not sure how to measure the token generation rate. It takes a bit of time for the model to start responding the first time it is loaded. But subsequent interactions are faster. Slower than GPT4, but faster than I can manage to read and understand... meaning it types faster than my in-depth reading speed. I guess if I try I can read faster than it types, or perhaps I'm just a slow reader :P English isn't my 1st language. Again, my specs are listed above, but a simple i5 16GB ram 240GB SSD, windows11 auto-managed page file. It works.

Can you test with max context size? Character or chat history that uses the whole 2048 context

I've done some test on oobabooga with and without the character context on 2 models.

  • Vicuna 13B without character context.
    vicuna test1.png
    with character context
    vicuna test2 .png

  • GPT4xAlpaca 13B without character context.
    Gpt4xalpaca test1.png
    with character context
    Gpt4xalpaca test2.png

  • My cpu spec is Ryzen 5 4600h :

  • 6 cores
  • 12 threads
  • Base Clock 3.0GHz
    And i have a 32GB rams.

PS: I don't have any character context that have 2048 context but you can imagine it take longer the more context you have. In my experiment Koboldcpp seem to process context and generate faster than oobabooga but oobabooga seem to give slightly better respond and doesn't cut out the output of the character like Koboldcpp does.

@Maruno

For ggml-vicuna-13b-4bit-rev1.bin:

Output generated in 429.89 seconds (0.47 tokens/s, 200 tokens, context 1317, seed 409617372)
Output generated in 170.12 seconds (1.18 tokens/s, 200 tokens, context 435, seed 22146058)
Output generated in 247.94 seconds (0.81 tokens/s, 200 tokens, context 633, seed 1962886833)
Output generated in 632.09 seconds (1.58 tokens/s, 1000 tokens, context 647, seed 2070444208)
Output generated in 572.93 seconds (0.79 tokens/s, 450 tokens, context 1274, seed 926636926)

I tried going max context but it crashed on me for now. I was also doing other things and my PC has many running programs while tokenization takes place, so it's not accurate.
llama_tokenize: too many tokens
raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')

Well if you want to use oobabooga and have only a CPU it will get slower the more context it have so you can only chat with it for a certain amount of time so in short it has a short memory. if you want to keep chatting with it you have to clear the old chat and start with 0 context.

  • For now the only solution is to use the colab someone have created: https://colab.research.google.com/drive/1VSHMCbBFtOGNbJzoSmEXr-OZF91TpWap
  • just run everything until the last one you have to change the code first by clicking show code and add the --cai-chat . If not the interface will be different from the one you're using. you can try adding other command but for now i only experiment with the --cai-chat command.
  • And when you lunch don't copy and paste the localhost website but instead click on the first link.

With this i can Generate about 4token/s and it process the context very quickly as well.

Is anyone here using a Linux VM to run this on a host system with an AMD CPU and a NVIDIA graphs card?

I've been trying absolutely everything under the Sun to make it work, but I keep getting errors after errors and the NVIDIA card doesn't seem to be recognized by the VM...
On that note, if I want to run this based on the GPU, should I still download and follow the install process for the "AMD" part of the guide due to my CPU being AMD or is that only relevant if I'm doing exclusively CPU access? None of that is made clear in the readmes in github, and I wish it was.

Thanks!

Just leaving my solution here in case anyone else is getting the 'llama_inference_offload' issue running ROCm on an AMD card.

Turns out you need to go into /text-generation-webui/modules/ and edit GPTQ_loader.py and change this line:
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa")))
to
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa-ROCm")))

With that you should be able to load the gpt4-x-alpaca-13b-native-4bit-128g model with the options --wbits 4 --groupsize 128.

I was also have a ton of crashes once I had it running, but it turns out that was transient loads on my crappy power supply that I'm running too close to the limit on. I fixed that by running a game in the background to keep the load up. haha

Hey guys, I get the same problem when trying to load the model "TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ" on ooba's colab version of the UI. Any hopes to get this solved?
More precisely:
Traceback (most recent call last):
File “/content/text-generation-webui/server.py”, line 59, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “/content/text-generation-webui/modules/models.py”, line 157, in load_model
from modules.GPTQ_loader import load_quantized
File “/content/text-generation-webui/modules/GPTQ_loader.py”, line 15, in
import llama_inference_offload
ModuleNotFoundError: No module named ‘llama_inference_offload’

EDIT: Added the quote of the error

I fixed the issue with changing "!python server.py --chat --wbits 4 --groupsize 128 --auto-devices" to "!python server.py --chat --auto-devices", but I got another error "FileNotFoundError: [Errno 2] No such file or directory: 'models/gpt-x-alpaca-13b-native-4bit-128g/pytorch_model-00001-of-00006.bin'" though. Not sure if the situation is better or worse.

I received the same exact error, anyone have any ideas? No idea what I'm doing.

me too same error

Sign up or log in to comment