Apr 8, 2023

Hey! I dont really know what I did, Imma just post the error I got. I also followed this tutorial: https://youtu.be/nVC9D9fRyNU

Traceback (most recent call last):
File "C:\Users\admin\Desktop\oobabooga-windows\text-generation-webui\server.py", line 302, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "C:\Users\admin\Desktop\oobabooga-windows\text-generation-webui\modules\models.py", line 100, in load_model
from modules.GPTQ_loader import load_quantized
File "C:\Users\admin\Desktop\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

dabeckham

Apr 8, 2023

Watch the video again and follow the directions more carefully.

Considering your title, "Error using ooba-gooba", your key clue should be:
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

Alydex

Apr 9, 2023

@blueisbest did you manage to fix the issue, im getting the same error

Psychopatz

Apr 9, 2023

It seems you're using an AMD CPU which has limited support, yep, we got the same problem. Sadly your only way of using this is the LlamaCPP method.

If someone here got any other methods, consider sharing one. Thanks

Enferlain

Apr 9, 2023

Why would having an amd cpu be an issue? Haven't heard about that before.

skippyssk

Apr 9, 2023

You need to update your start-webui bat file so the call python server.py has the arguments --chat --wbits 4 --groupsize 128.
That fixed it for me.

doctord98

Apr 9, 2023

I am on the amd cpu too, and getting this error,
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

synthetisoft

Apr 9, 2023

•

edited Apr 9, 2023

I get the same error on Intel Xeon Cascade Lake P-8259L / NVIDIA T4 (Ubuntu 20.04)

perelmanych

Apr 9, 2023

•

edited Apr 9, 2023

I have the same issue. It looks like something is wrong. If during installation you select CPU, then it skips installation of GPTQ, but once you run start-webui it tries to load GPTQ.

ELEPOT

Apr 9, 2023

I fixed the issue with changing "!python server.py --chat --wbits 4 --groupsize 128 --auto-devices" to "!python server.py --chat --auto-devices", but I got another error "FileNotFoundError: [Errno 2] No such file or directory: 'models/gpt-x-alpaca-13b-native-4bit-128g/pytorch_model-00001-of-00006.bin'" though. Not sure if the situation is better or worse.

tamal777

Apr 9, 2023

same problem here.

synthetisoft

Apr 9, 2023

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

synthetisoft

Apr 9, 2023

Also passed --wbits 4 --groupsize 128 to server.py

ELEPOT

Apr 9, 2023

I get ther error after following https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model: "Could not find the quantized model in .pt or .safetensors format, exiting...", Any ideas?

perelmanych

Apr 9, 2023

•

edited Apr 9, 2023

I get ther error after following https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model: "Could not find the quantized model in .pt or .safetensors format, exiting...", Any ideas?

Either you are trying to load not quantized model if so remove --wbits 4 --groupsize 128, or you didn't download quantized model into models folder.

perelmanych

Apr 9, 2023

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

I am getting "OSError: CUDA_HOME environment variable is not set. Please set it to your CUDA install root."

tamal777

Apr 9, 2023

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

didn't you need a cuda gpu for this to work?

synthetisoft

Apr 9, 2023

Fixed it by following step 1 from here: https://github.com/oobabooga/text-generation-webui/wiki/LLaMA-model

didn't you need a cuda gpu for this to work?

I am using a cuda GPU.

perelmanych

Apr 9, 2023

With GPU it works fine, you start to have problems when you try to use it with CPU.

synthetisoft

Apr 9, 2023

With GPU it works fine, you start to have problems when you try to use it with CPU.

oh sry. haven't tried CPU yet. trying to get it working good on AWS gpu before I try to reduce cost.

danielgangl

Apr 9, 2023

•

edited Apr 9, 2023

import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

Having the exact same issue

Maruno

Apr 9, 2023

I have the same exact problem and I'm using an AMD CPU, therefore my concern is whether it is fixable. Should I wait for the fix or delete everything I've downloaded?

mhammad

Apr 10, 2023

I GOT THIS ERROR TOO . PLEASE HELP

CUDA SETUP: Loading binary E:\vicuna-chatgpt4\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cpu.dll...
E:\vicuna-chatgpt4\oobabooga-windows\installer_files\env\lib\site-packages\bitsandbytes\cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
Traceback (most recent call last):
File "E:\vicuna-chatgpt4\oobabooga-windows\text-generation-webui\server.py", line 302, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "E:\vicuna-chatgpt4\oobabooga-windows\text-generation-webui\modules\models.py", line 100, in load_model
from modules.GPTQ_loader import load_quantized
File "E:\vicuna-chatgpt4\oobabooga-windows\text-generation-webui\modules\GPTQ_loader.py", line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'
Press any key to continue . . .

aryasarukkai

Apr 10, 2023

issues

CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
CUDA SETUP: Required library version not found: libsbitsandbytes_cpu.so. Maybe you need to compile it from source?
CUDA SETUP: Defaulting to libbitsandbytes_cpu.so...
dlopen(/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so, 0x0006): tried: '/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cpu.so' (not a mach-o file)
/Users/aryasarukkai/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/cextension.py:31: UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
Traceback (most recent call last):
File "/Users/aryasarukkai/text-generation-webui/server.py", line 302, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "/Users/aryasarukkai/text-generation-webui/modules/models.py", line 100, in load_model
from modules.GPTQ_loader import load_quantized
File "/Users/aryasarukkai/text-generation-webui/modules/GPTQ_loader.py", line 14, in
import llama_inference_offload
ModuleNotFoundError: No module named 'llama_inference_offload'

I've tried with all solutions in this thread so far- no luck unfortunately.

AI-Boss

Apr 10, 2023

To solve the "ModuleNotFoundError: No module named 'llama_inference_offload'" problem I followed the advice of @synthetisoft and did:

mkdir \oobabooga-windows\text-generation-webui\repositories
cd repositories
git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda

It worked, I got passed that stage.

Now I get stuck at this line: "storage = cls(wrap_storage=untyped_storage)"

I think I don't have enough memory, or free memory. I have roughly 7GB of free memory.
Memory consumption spikes to 100% and the script crashes.

Here's my info up to that point:

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models\anon8231489123_vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Loading model ...

[
3 times the same warning for files storage.py:899, _utils.py:776 and torch.py:99:
UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
]
storage = cls(wrap_storage=untyped_storage)
Press any key to continue . . . [That's where the script crashes]

To the line that starts the web ui in start-webui.bat:
call python server.py --auto-devices --chat --wbits 4 --groupsize 128

I tried adding "--pre_layer 20" (tried --pre_layer 5, 50, 100, 1000 ) with no success. Same result.

My specs:

Windows 11 16GB memory, no GPU (to speak of).

CPU

Intel(R) Core(TM) i5-10400 CPU @ 2.90GHz

Base speed:	2.90 GHz
Sockets:	1
Cores:	6
Logical processors:	12
Virtualization:	Enabled
L1 cache:	384 KB
L2 cache:	1.5 MB
L3 cache:	12.0 MB

Memory

16.0 GB

Speed:	2666 MHz
Slots used:	2 of 4
Form factor:	DIMM
Hardware reserved:	151 MB

GPU 0 (on-board)

Intel(R) UHD Graphics 630

Driver version:	30.0.101.1273
Driver date:	14/01/2022
DirectX version:	12 (FL 12.1)
Physical location:	PCI bus 0, device 2, function 0

Utilization	1%
Dedicated GPU memory	
Shared GPU memory	0.4/7.9 GB
GPU Memory	0.4/7.9 GB

AI-Boss

Apr 10, 2023

•

edited Apr 10, 2023

Tried changing the flag back to --cai-chat , now I get a different error, saying I have not enough memory:

UserWarning: The installed version of bitsandbytes was compiled without GPU support. 8-bit optimizers and GPU quantization are unavailable.
warn("The installed version of bitsandbytes was compiled without GPU support. "
Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
CUDA extension not installed.
Found the following quantized model: models\anon8231489123_vicuna-13b-GPTQ-4bit-128g\vicuna-13b-4bit-128g.safetensors
Traceback (most recent call last):
[... python files with their code lines]
RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 35389440 bytes.
Press any key to continue . . .

AI-Boss

Apr 10, 2023

OK, so I tried bumping up my virtual memory and it seems to have helped. I now get past THIS memory issue, but I get an error saying that pytourch was installed wihtout conda support. or the other way around... This is as far as I've got for now. Going back to my real life :-)

Maruno

Apr 11, 2023

I've been trying to fix this problem for days now and here what i learned so far:

You got this error " import llama_inference_offload ModuleNotFoundError: No module named 'llama_inference_offload' " Because you're missing the repositories folder and the
GPTQ-for-LLaMa folder you can follow @synthetisoft by follow step 1 or try this fix if it's still doesn't work https://github.com/oobabooga/text-generation-webui/issues/416#issuecomment-1475078571
You get this error because you install it with CPU in mind if you did that the repositories folder will be missing because it's meant for the GPU CUDA models. Another reason is that you're trying to load a
GPU models without the GPTQ-for-LLaMa and CUDA package install you can regconize the CPU model by it prefix " ggml- ".
I tried to fix everything by following all the step above but i can't make it work with CPU. so i came to conclusion that it's only work for NVIDIA GPU.
```
   + Solution:
```

For those of you who're trying to use it with CPU i got good new for you guys there's an alternative and it's very simple. it call " Koboldcpp " it's like llamacpp but with Kobold Webui
you can have all the feature that oobabooga have to offer if you don't mind learning how to use the Kobold webui.
For those of you who're trying to use it with GPU sorry it's only work with CPU for now.
```
   + Installation :
```

Go to " https://github.com/LostRuins/koboldcpp " you can read the description if you want.
Scroll down to Usage you will see the blue Download link click on it.
You can read the description of how to use it and click download the koboldcpp.exe
After that you can download the CPU model of the GPT x ALPACA model here: https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/tree/main/gpt4-x-alpaca-13b-ggml-q4_1-from-gptq-4bit-128g .It will take a while since it's about 10GB.
Just drag and drop the model or manually search for the ggml model yourself this work for every CPU model. Next wait until it finished loading the model and copy the http://localhost:5001
and paste it on your browser.
you can find out more about koboldcpp and how to use it here: https://www.reddit.com/r/LocalLLaMA/comments/12cfnqk/koboldcpp_combining_all_the_various_ggmlcpp_cpu/

That's all for now hope this help.

AI-Boss

Apr 11, 2023

•

edited Apr 11, 2023

The script has been updated for CPU usage. Run 'iex (irm vicuna.tb.ag)' in powershell (in the directory you want to install the ui in) and the first question will allow you to install just the updated CPU mode of the model. I first deleted the ooga-booga directory and ran the script. It loaded a lot to memory and page file, but it works pretty well on my i5 16GB with lots of chrome tabs open... Great work.

see this vieo:
https://youtu.be/d4dk_7FptXk

Maruno

Apr 11, 2023

•

edited Apr 11, 2023

I tried it, and it also seems to work with the GPT4 x Alpaca CPU model. But it uses 20 GB of my 32GB rams and only manages to generate 60 tokens in 5mins. In koboldcpp i can generate 500 tokens in only 8 mins and it only uses 12 GB of my RAM. I don't know how it manages to use 20 GB of my ram and still only generate 0.17token/s I guess I'll stick koboldcpp. But what about you did you get a faster generation when you use the Vicuna model?

AI-Boss

Apr 11, 2023

I tried it, and it also seems to work with the GPT4 x Alpaca CPU model. But it uses 20 GB of my 32GB rams and only manages to generate 60 tokens in 5mins. In koboldcpp i can generate 500 tokens in only 8 mins and it only uses 12 GB of my RAM. I don't know how it manages to use 20 GB of my ram and still only generate 0.17token/s I guess I'll stick koboldcpp. But what about you did you get a faster generation when you use the Vicuna model?

I'm not sure how to measure the token generation rate. It takes a bit of time for the model to start responding the first time it is loaded. But subsequent interactions are faster. Slower than GPT4, but faster than I can manage to read and understand... meaning it types faster than my in-depth reading speed. I guess if I try I can read faster than it types, or perhaps I'm just a slow reader :P English isn't my 1st language. Again, my specs are listed above, but a simple i5 16GB ram 240GB SSD, windows11 auto-managed page file. It works.

Enferlain

Apr 12, 2023

I tried it, and it also seems to work with the GPT4 x Alpaca CPU model. But it uses 20 GB of my 32GB rams and only manages to generate 60 tokens in 5mins. In koboldcpp i can generate 500 tokens in only 8 mins and it only uses 12 GB of my RAM. I don't know how it manages to use 20 GB of my ram and still only generate 0.17token/s I guess I'll stick koboldcpp. But what about you did you get a faster generation when you use the Vicuna model?

I'm not sure how to measure the token generation rate. It takes a bit of time for the model to start responding the first time it is loaded. But subsequent interactions are faster. Slower than GPT4, but faster than I can manage to read and understand... meaning it types faster than my in-depth reading speed. I guess if I try I can read faster than it types, or perhaps I'm just a slow reader :P English isn't my 1st language. Again, my specs are listed above, but a simple i5 16GB ram 240GB SSD, windows11 auto-managed page file. It works.

Can you test with max context size? Character or chat history that uses the whole 2048 context

Maruno

Apr 12, 2023

•

edited Apr 12, 2023

I've done some test on oobabooga with and without the character context on 2 models.

Vicuna 13B without character context.

with character context
GPT4xAlpaca 13B without character context.

with character context
My cpu spec is Ryzen 5 4600h :

6 cores
12 threads
Base Clock 3.0GHz
And i have a 32GB rams.

PS: I don't have any character context that have 2048 context but you can imagine it take longer the more context you have. In my experiment Koboldcpp seem to process context and generate faster than oobabooga but oobabooga seem to give slightly better respond and doesn't cut out the output of the character like Koboldcpp does.

AI-Boss

Apr 12, 2023

@Maruno

For ggml-vicuna-13b-4bit-rev1.bin:

Output generated in 429.89 seconds (0.47 tokens/s, 200 tokens, context 1317, seed 409617372)
Output generated in 170.12 seconds (1.18 tokens/s, 200 tokens, context 435, seed 22146058)
Output generated in 247.94 seconds (0.81 tokens/s, 200 tokens, context 633, seed 1962886833)
Output generated in 632.09 seconds (1.58 tokens/s, 1000 tokens, context 647, seed 2070444208)
Output generated in 572.93 seconds (0.79 tokens/s, 450 tokens, context 1274, seed 926636926)

I tried going max context but it crashed on me for now. I was also doing other things and my PC has many running programs while tokenization takes place, so it's not accurate.
llama_tokenize: too many tokens
raise RuntimeError(f'Failed to tokenize: text="{text}" n_tokens={n_tokens}')

Maruno

Apr 12, 2023

•

edited Apr 12, 2023

Well if you want to use oobabooga and have only a CPU it will get slower the more context it have so you can only chat with it for a certain amount of time so in short it has a short memory. if you want to keep chatting with it you have to clear the old chat and start with 0 context.

For now the only solution is to use the colab someone have created: https://colab.research.google.com/drive/1VSHMCbBFtOGNbJzoSmEXr-OZF91TpWap
just run everything until the last one you have to change the code first by clicking show code and add the --cai-chat . If not the interface will be different from the one you're using. you can try adding other command but for now i only experiment with the --cai-chat command.
And when you lunch don't copy and paste the localhost website but instead click on the first link.

With this i can Generate about 4token/s and it process the context very quickly as well.

UserNew

Apr 19, 2023

Is anyone here using a Linux VM to run this on a host system with an AMD CPU and a NVIDIA graphs card?

I've been trying absolutely everything under the Sun to make it work, but I keep getting errors after errors and the NVIDIA card doesn't seem to be recognized by the VM...
On that note, if I want to run this based on the GPU, should I still download and follow the install process for the "AMD" part of the guide due to my CPU being AMD or is that only relevant if I'm doing exclusively CPU access? None of that is made clear in the readmes in github, and I wish it was.

Thanks!

JediDwag

Apr 25, 2023

Just leaving my solution here in case anyone else is getting the 'llama_inference_offload' issue running ROCm on an AMD card.

Turns out you need to go into /text-generation-webui/modules/ and edit GPTQ_loader.py and change this line:
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa")))
to
sys.path.insert(0, str(Path("repositories/GPTQ-for-LLaMa-ROCm")))

With that you should be able to load the gpt4-x-alpaca-13b-native-4bit-128g model with the options --wbits 4 --groupsize 128.

I was also have a ton of crashes once I had it running, but it turns out that was transient loads on my crappy power supply that I'm running too close to the limit on. I fixed that by running a game in the background to keep the load up. haha

PopGa

May 16, 2023

•

edited May 16, 2023

Hey guys, I get the same problem when trying to load the model "TheBloke_Wizard-Vicuna-13B-Uncensored-GPTQ" on ooba's colab version of the UI. Any hopes to get this solved?
More precisely:
Traceback (most recent call last):
File “/content/text-generation-webui/server.py”, line 59, in load_model_wrapper
shared.model, shared.tokenizer = load_model(shared.model_name)
File “/content/text-generation-webui/modules/models.py”, line 157, in load_model
from modules.GPTQ_loader import load_quantized
File “/content/text-generation-webui/modules/GPTQ_loader.py”, line 15, in
import llama_inference_offload
ModuleNotFoundError: No module named ‘llama_inference_offload’

EDIT: Added the quote of the error

joesnuffy

Oct 25, 2023

I fixed the issue with changing "!python server.py --chat --wbits 4 --groupsize 128 --auto-devices" to "!python server.py --chat --auto-devices", but I got another error "FileNotFoundError: [Errno 2] No such file or directory: 'models/gpt-x-alpaca-13b-native-4bit-128g/pytorch_model-00001-of-00006.bin'" though. Not sure if the situation is better or worse.

I received the same exact error, anyone have any ideas? No idea what I'm doing.

fakelifesucks

Nov 8, 2023

me too same error

Error using ooba-gooba

Trying on M1:===================================BUG REPORT===================================Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues

Trying on M1:
===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues