torch.cuda.OutOfMemoryError: CUDA out of memory. with a RTX3070 8gb

#14
by VAVUSH - opened

I'm getting this issue:
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.00 MiB (GPU 0; 8.00 GiB total capacity; 7.07 GiB already allocated; 0 bytes free; 7.31 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF
Output generated in 3.64 seconds (0.00 tokens/s, 0 tokens, context 43)

I'm wandering if it has to do with the fact that I'm running it on a laptop with a RTX 3070 8GB that has also an additional GPU that gives the CUDA set up of 8.6.

CUDA SETUP: CUDA runtime path found: C:\AI\installer_files\env\bin\cudart64_110.dll
CUDA SETUP: Highest compute capability among GPUs detected: 8.6
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary C:\AI\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll...
Loading vicuna-13b-GPTQ-4bit-128g...

edit start-webui.bat and replace all the text with:

@echo off

@echo Starting the web UI...

cd /D "%~dp0"

set MAMBA_ROOT_PREFIX=%cd%\installer_files\mamba
set INSTALL_ENV_DIR=%cd%\installer_files\env

if not exist "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" (
call "%MAMBA_ROOT_PREFIX%\micromamba.exe" shell hook >nul 2>&1
)
call "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" activate "%INSTALL_ENV_DIR%" || ( echo MicroMamba hook not found. && goto end )
cd text-generation-webui

call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30

:end
pause

It works now, fun fact: the pc makes a noise for every singe word of the response... it sound like the noise of a hard drive however here all solid state... curious...
Anyway it is pretty slow, from videos I was expecting it to be a bit faster.
Output generated in 25.03 seconds (0.96 tokens/s, 24 tokens, context 43)
Output generated in 46.60 seconds (1.12 tokens/s, 52 tokens, context 81)
Output generated in 36.89 seconds (1.11 tokens/s, 41 tokens, context 146)
Output generated in 92.08 seconds (1.11 tokens/s, 102 tokens, context 383)
Output generated in 13.66 seconds (0.95 tokens/s, 13 tokens, context 26)
Output generated in 8.67 seconds (0.92 tokens/s, 8 tokens, context 26)
Output generated in 32.35 seconds (1.08 tokens/s, 35 tokens, context 33)

I'm wandering if it has to do with the fact that I'm running it on a laptop with a RTX 3070 8GB that has also an additional GPU that gives the CUDA set up of 8.6.

Just a low amount of vram for this model.

It works now, fun fact: the pc makes a noise for every singe word of the response... it sound like the noise of a hard drive however here all solid state... curious...

Same. It could be a coil whine.

Anyway it is pretty slow, from videos I was expecting it to be a bit faster.

Now you are sharing it between vram and ram with "--pre_layer 30". It's slow.

edit start-webui.bat and replace all the text with:

@echo off

@echo Starting the web UI...

cd /D "%~dp0"

set MAMBA_ROOT_PREFIX=%cd%\installer_files\mamba
set INSTALL_ENV_DIR=%cd%\installer_files\env

if not exist "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" (
call "%MAMBA_ROOT_PREFIX%\micromamba.exe" shell hook >nul 2>&1
)
call "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" activate "%INSTALL_ENV_DIR%" || ( echo MicroMamba hook not found. && goto end )
cd text-generation-webui

call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30

:end
pause

so far so good, as it is working
But to summarize a long text - still getting the
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 262.00 MiB (GPU 0; 8.00 GiB total capacity; 6.74 GiB already allocated; 0 bytes free; 7.06 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF

[DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes](RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes.)

I do every hack propose here and on my 3070 still dont working.

I dont know what i can do for it works.

[DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes](RuntimeError: [enforce fail at C:\cb\pytorch_1000000000000\work\c10\core\impl\alloc_cpu.cpp:72] data. DefaultCPUAllocator: not enough memory: you tried to allocate 141557760 bytes.)

I do every hack propose here and on my 3070 still dont working.

I dont know what i can do for it works.

Yep same problem here on a 3060 with 12gb vram

Works fine for me, but slow. Hmm. Thanks anyway!

edit start-webui.bat and replace all the text with:

@echo off

@echo Starting the web UI...

cd /D "%~dp0"

set MAMBA_ROOT_PREFIX=%cd%\installer_files\mamba
set INSTALL_ENV_DIR=%cd%\installer_files\env

if not exist "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" (
call "%MAMBA_ROOT_PREFIX%\micromamba.exe" shell hook >nul 2>&1
)
call "%MAMBA_ROOT_PREFIX%\condabin\micromamba.bat" activate "%INSTALL_ENV_DIR%" || ( echo MicroMamba hook not found. && goto end )
cd text-generation-webui

call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30

:end
pause

Ilumunati confirmed, thanks

i still had lots of memory errors on my 3080 8gb laptop gpu with the --pre_layer 30 flag.
especially with large complex personas.
since setting prelayer to 20 i didnt have a single memory error but its getting real slow.

Output generated in 97.41 seconds (0.55 tokens/s, 54 tokens, context 856, seed 1479715562)
Output generated in 102.62 seconds (0.57 tokens/s, 58 tokens, context 921, seed 1659762763)
Output generated in 161.13 seconds (0.58 tokens/s, 93 tokens, context 991, seed 479945416)

--pre_layer 25 seems to be a good compromise.
Output generated in 255.88 seconds (0.78 tokens/s, 199 tokens, context 276, seed 428896125)

I managed to get rid of the error at least on my 8gb 2060 super by opening up webui.py in notepad++ and on line 13, adding in more parameters
CMD_FLAGS = '--chat --model-menu --auto-devices --threads 8 --wbits 4 --groupsize 128 --pre_layer 30'
That was at least what im using for the vicuna model, probably where the 128 comes in for group size.

I wasnt able to get the arguments to launch with the "start windows" bat file at all, its on the call pythose webui which "should" add those arguments but for some reason errors out so I did it manually. Seems to work, its as slow as dialup internet though when I ask it a question but hey, it worked.

Please handle with care, while we're on the metaphor, this UPS box doesn't have the kind of maps some of you guys have.
So I get this message after swapping out the start_windows.bat file

Starting the web UI...
The system cannot find the path specified.
MicroMamba hook not found.
Press any key to continue . . .

I'm assuming MicroMamba is a python, prerequisite? The kind of thing that I've been creating isolated python environments for, in certain directories in order to install local AI stuff (llms, tts, sd) without screwing up the python for everything else?

8GB rtx 3070 @80-105w

can I install this MicroMamba with conda? If so do I just go to the master directory of [oobabooga] as in one file up from start_windows.bat, then do the virtual environment there, or do I install MicroMamba? somewhere else? will installing the mamba work as a base?

So I just read (I THINK) that it was supposed to give me that response, (according to the actual info inside the windows_start.bat) and then I was supposed to open the cmd prompt and go into that particular directory

cd text-generation-webui

and then enter

call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30

it gave me these error messages

C:\Users\jattoedaltni\oobabooga_windows\oobabooga_windows\text-generation-webui>call python server.py --auto-devices --chat --threads 8 --wbits 4 --groupsize 128 --pre_layer 30
Traceback (most recent call last):
File "C:\Users\jattoedaltni\oobabooga_windows\oobabooga_windows\text-generation-webui\server.py", line 47, in
from modules import chat, shared, training, ui, utils
File "C:\Users\jattoedaltni\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\chat.py", line 18, in
from modules.text_generation import (generate_reply, get_encoded_length,
File "C:\Users\jattoedaltni\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\text_generation.py", line 17, in
from modules.models import clear_torch_cache, local_rank
File "C:\Users\jattoedaltni\oobabooga_windows\oobabooga_windows\text-generation-webui\modules\models.py", line 10, in
from transformers import (AutoConfig, AutoModel, AutoModelForCausalLM,

ImportError: cannot import name 'BitsAndBytesConfig' from 'transformers' (C:\Users\jattoedaltni\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers_init_.py)

now when I double click window_start.bat it just gives me the same error message with a different shell

Sign up or log in to comment