[GUIDE] Installing Vicuna on Linux

#34
by ltnn1 - opened

I've been banging my head against the wall for the last few days trying to get this to work on Linux, so I wrote this guide that fellow newbs don't have to go through what I went through. My distro is Arch (btw) but the guide itself is distro-agnostic and I believe it can even work on MacOS or Windows WSL2 (do try and let me know if it does).

  1. Install conda (or microconda, or mamba, or micromamba, choose your poison).
  2. conda create -n vicuna-matata pytorch torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit -c 'nvidia/label/cuda-11.7.0' -c pytorch -c nvidia
    conda activate vicuna-matata
  3. git clone https://github.com/oobabooga/text-generation-webui
    cd text-generation-webui
    pip install -r requirements.txt
  4. python download-model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g
  5. mkdir repositories
    cd repositories
    git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda
    cd GPTQ-for-LLaMa
    pip install -r requirements.txt
    python setup_cuda.py install
  6. Go back to text-generation-webui
  7. python server.py --model anon8231489123/vicuna-13b-GPTQ-4bit-128g --auto-devices --wbits 4 --groupsize 128 --chat

I go into a bit more detail, giving some tips, and a debug at the end in this more detailed version of the guide so that I won't clutter it up here. Let me know if you have any questions.

Small correction:

python download_model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g

This should be download-model.py

Small correction:

python download_model.py anon8231489123/vicuna-13b-GPTQ-4bit-128g

This should be download-model.py

Fixed, thanks!

Hi, I tried this - thanks for the preparation!
However, running into the following error:

python server.py --model anon8231489123/vicuna-13b-GPTQ-4bit-128g --auto-devices --wbits 4 --groupsize 128 --chat
bin /home/ubuntu/anaconda3/envs/vicuna-matata/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
Loading anon8231489123/vicuna-13b-GPTQ-4bit-128g...
Could not find the quantized model in .pt or .safetensors format, exiting...

When searching for it, it is here:

find ./ -name "*safetensors"
./models/anon8231489123_vicuna-13b-GPTQ-4bit-128g/vicuna-13b-4bit-128g.safetensors

What might I be missing?

Hi, I tried this - thanks for the preparation!
However, running into the following error:

python server.py --model anon8231489123/vicuna-13b-GPTQ-4bit-128g --auto-devices --wbits 4 --groupsize 128 --chat
bin /home/ubuntu/anaconda3/envs/vicuna-matata/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
Loading anon8231489123/vicuna-13b-GPTQ-4bit-128g...
Could not find the quantized model in .pt or .safetensors format, exiting...

When searching for it, it is here:

find ./ -name "*safetensors"
./models/anon8231489123_vicuna-13b-GPTQ-4bit-128g/vicuna-13b-4bit-128g.safetensors

What might I be missing?

How did you download the model? Using the download-model.py script or clone directly from hugging face? Have you try conda deactivate, then conda remove -n vicuna-matata --all and do all steps above again? Might help if you specify the exact python version as well in the conda create command i.e. conda create -n vicuna-matata pytorch torchvision torchaudio python=3.10.9 pytorch-cuda=11.7 cuda-toolkit -c 'nvidia/label/cuda-11.7.0' -c pytorch -c nvidia

And finally, most of my problems with CUDA had been with GPTQ (step 5 above), when you ran the python setup_cuda.py install did it throw any errors?

thanks for the hint,

yes I downloaded thorugh the download-model.py script. When running python setup_cuda.py install I see a few warnings, like the following one:

/home/ubuntu/anaconda3/envs/vicuna-matata/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:244:1: note: declared here
244 | T * data() const {
| ^ ~~
quant_cuda_kernel.cu:507:2019: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations]
507 | AT_DISPATCH_FLOATING_TYPES(
|

But I thought if something is failing I would see an error.

Are there other dependencies I may have missed? And why is it not using the "./models/anon8231489123_vicuna-13b-GPTQ-4bit-128g/vicuna-13b-4bit-128g.safetensors", which can be found?

I will try a bit more research, but in case you spot something, much appreciated!

thanks for the hint,

yes I downloaded thorugh the download-model.py script. When running python setup_cuda.py install I see a few warnings, like the following one:

/home/ubuntu/anaconda3/envs/vicuna-matata/lib/python3.10/site-packages/torch/include/ATen/core/TensorBody.h:244:1: note: declared here
244 | T * data() const {
| ^ ~~
quant_cuda_kernel.cu:507:2019: warning: ‘T* at::Tensor::data() const [with T = int]’ is deprecated: Tensor.data() is deprecated. Please use Tensor.data_ptr() instead. [-Wdeprecated-declarations]
507 | AT_DISPATCH_FLOATING_TYPES(
|

But I thought if something is failing I would see an error.

Are there other dependencies I may have missed? And why is it not using the "./models/anon8231489123_vicuna-13b-GPTQ-4bit-128g/vicuna-13b-4bit-128g.safetensors", which can be found?

I will try a bit more research, but in case you spot something, much appreciated!

It might have something to do with this issue or this one.

From my experience, problems with quantized models are almost always related to GPTQ (and their CUDA implementation, see if you're using their Triton or CUDA branch, Triton is only for Linux). Is the problem only with quantized models, have you tried loading up any other like GPT-J or GPT-NeoX or the vanilla LLaMAs? Also, are you using GPU or CPU only?

If just simply conda deactivate and removing the env doesn't work, then maybe you should try deleting the /text-generation-webui/ folder and try again from scratch. Try doing pip install -r requirements.txt in BOTH text-generation-webui and GPTQ-for-LLaMa as well (updated in the guide above).

From what I can see, the oogabooga's fork of GPTQ had some changes in their requirements.txt 2 days ago, that might be the cause.

If you've tried all the above and are really desperate, try one of those one-click-installer scrips they might do a better job of preparing the env and dependencies than I did.

Another thing that I've noticed is that my CLI doesn't have this line bin /home/ubuntu/anaconda3/envs/vicuna-matata/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so, so that can be another thing to pay attention to.

libbitsandbytes had an error that required a dirty fix in the previous version of text-generation-webui, so you might want to check the thread out.

Hi, I tried this - thanks for the preparation!
However, running into the following error:

python server.py --model anon8231489123/vicuna-13b-GPTQ-4bit-128g --auto-devices --wbits 4 --groupsize 128 --chat
bin /home/ubuntu/anaconda3/envs/vicuna-matata/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117_nocublaslt.so
Loading anon8231489123/vicuna-13b-GPTQ-4bit-128g...
Could not find the quantized model in .pt or .safetensors format, exiting...

When searching for it, it is here:

find ./ -name "*safetensors"
./models/anon8231489123_vicuna-13b-GPTQ-4bit-128g/vicuna-13b-4bit-128g.safetensors

What might I be missing?

try --model anon8231489123_vicuna-13b-GPTQ-4bit-128g

Kind of hit the wall with this. Tried on Jarvislabs.ai, ninja did not built due to cuda version mismatch hmf is missing thus lots of division to zero errors ...
My local server is also kvm virtualised and then containerized Arch lxd containers on Arch KVM on arch host (Zen).

So my question is is there a freaking way to run this using transformers or accelerate or it is not possible because they not support 4 bit models or something. I am new to these language models (2 weeks)
Trying to grasp llma gpt gptq. Can you give me a hand.

Thank you for sharing and best regards.

Do you still need help?

Hi, I'm trying to run this in a headless environment without GPUs. Possible? What changes about the process?

Hi, I'm trying to run this in a headless environment without GPUs. Possible? What changes about the process?

Check out https://github.com/ggerganov/llama.cpp

Thanks for the above description, this command works on a g4dn.4xlarge
"(vicuna-matata) ubuntu@ip-10-0-0-69:~/text-generation-webui$ python server.py --auto-devices --wbits 4 --groupsize 128 --chat "

I have an error when I send a message:

INFO:Loading anon8231489123_vicuna-13b-GPTQ-4bit-128g...
ERROR:No model is loaded! Select one in the Model tab.

Even if the model is selected

I tried everything I could, it didn’t work, giving bizarre errors more and more. I give up. And now, how do I uninstall all these things from python etc etc from my system?
Can anyone help me with that? Please…

Sign up or log in to comment