TheBloke/Wizard-Vicuna-30B-Uncensored-GGML · I can't get GGML GPU accelleration to work with Wizard-Vicuna-30B 5

Jun 8, 2023

•

edited Jun 8, 2023

Well actually I can't get GPU acceleration to work with any model but I've only tried this model and only the 5_1 version of it.
I'm using Windows 10

I followed the steps here. At least I think I did.
https://github.com/oobabooga/text-generation-webui/blob/main/docs/llama.cpp-models.md

I tried
pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

I don't think I noticed any errors while performing these steps. Tried it a few times. I updated Ooba as well.

Afterwards I used these settings. But I tried all sorts of different settings. Tried only Pre_Layer or only N-GPU-Layers. Saving and reloading etc.
But my VRAM does not get used at all. I even tried turning on gptq-for-llama but I get errors. I don't know what that even if though.

If anyone has any ideas or can confirm if this model supports or does not support GPU Acceleration let me know.
Thank you.

TheBloke

Owner Jun 8, 2023

I've tested text-generation-webui and it definitely does work with GGML models with CUDA acceleration. And this model does support that - all GGML models do; there aren't "models with GPU" and "models without".

As you're on Windows, it may be harder to get it working. Do you have the CUDA toolkit installed? That's a requirement for compiling llama-cpp-python with CUDA support.

On Linux I can install llama-cpp-python like so:

pip uninstall -y llama-cpp-python
CMAKE_ARGS="-DLLAMA_CUBLAS=on" FORCE_CMAKE=1 pip install llama-cpp-python --no-cache-dir

And it then works with GPU offload, including in text-generation-webui

I'm afraid I'm unable to test anything on Windows, so I'm not sure what to suggest. I can help with getting it working in WSL2, if you feel like installing that.

Goldenblood56

Jun 8, 2023

•

edited Jun 8, 2023

Thanks TheBloke. Your a hard worker so I'm surprised and appreciate you even answering this and this quickly as well. I am starting to figure out all of this but I'm still very lacking. I don't do any sort of non-windows OS stuff. I don't know what WSL2 is. I do not know if I have CUDA toolkit. If it's not needed for Llama.cpp, SD, or Ooba only for GPU accleration then that may be my issue. But it's not like I got an error when I input...

pip uninstall -y llama-cpp-python
set CMAKE_ARGS="-DLLAMA_CUBLAS=on"
set FORCE_CMAKE=1
pip install llama-cpp-python --no-cache-dir

However at least I have something to look into. I read others having my issue too. At least the good news is that you answered the most important question that GGML all supports GPU acceleration so now I only need to figure out my issue I don't have to download other models to troubleshoot. If I find the solution I will likely post it here. Thanks. I may look into things like WSL2 etc. However it takes me a lot to pickup on all of this.

TheBloke

Owner Jun 8, 2023

OK, good luck!

Goldenblood56

Jun 12, 2023

•

edited Jun 13, 2023

I can help with getting it working in WSL2, if you feel like installing that.

I ended up getting it working. Several issues. Hard to explain.
I was going to put in WSL2 but I got it working in windows 10.

concedo

Jun 15, 2023

@Goldenblood56 another alternative you might consider for windows is KoboldCpp, there are ready-to-use exes which come with GPU support and no installation required.

Goldenblood56

Jun 15, 2023

•

edited Jun 15, 2023

Thanks but I finally got it all working on Ooba and really well. I can run up to 65B models with alright but slower speeds. I can run 30B models very well. With quite the speed. GGML CPU with GPU offloading.

psyberm

Jun 19, 2023

Thanks but I finally got it all working on Ooba and really well. I can run up to 65B models with alright but slower speeds. I can run 30B models very well. With quite the speed. GGML CPU with GPU offloading.

it is always helpful when you share the fix

Goldenblood56

Jun 19, 2023

•

edited Jun 19, 2023

Thanks but I finally got it all working on Ooba and really well. I can run up to 65B models with alright but slower speeds. I can run 30B models very well. With quite the speed. GGML CPU with GPU offloading.

it is always helpful when you share the fix

I did so much that I can't name all possible solutions. I just found a guide of someone else who did it. I re-installed everything but first I uninstalled it all to do a clean install. So remove VS and Cuda and then install Visual Studios 2022, Cudatool kit 12.1, but I think one of the major issues was that I was not activating the right conda environment. So when I was install llama-cpp with GPU I was not really installing it.

psyberm

Jun 20, 2023

Thanks but I finally got it all working on Ooba and really well. I can run up to 65B models with alright but slower speeds. I can run 30B models very well. With quite the speed. GGML CPU with GPU offloading.

it is always helpful when you share the fix

I did so much that I can't name all possible solutions. I just found a guide of someone else who did it. I re-installed everything but first I uninstalled it all to do a clean install. So remove VS and Cuda and then install Visual Studios 2022, Cudatool kit 12.1, but I think one of the major issues was that I was not activating the right conda environment. So when I was install llama-cpp with GPU I was not really installing it.

Thank you I updated to CUDA 12.1 and ran the following from ooba's conda environment and was able to get it working

set FORCE_CMAKE=1
set "CMAKE_ARGS=-DLLAMA_CUBLAS=on"
set "CUDAFLAGS=-arch=all -lcublas"
python -m pip install git+https://github.com/abetlen/llama-cpp-python

Goldenblood56

Jun 20, 2023

Your welcome. Glad it worked. I also think those commands changed since I did it. This stuff all evolves to quickly! lol

TheBloke
/

Wizard-Vicuna-30B-Uncensored-GGML

I can't get GGML GPU accelleration to work with Wizard-Vicuna-30B 5_1?