TheBloke/koala-7B-GPTQ · Won't work with GPTQ

cmh

Apr 7, 2023

•

edited Apr 7, 2023

Updated text-generation-webui (so GPTQ's repo is on the cuda branch and probably still on an older commit).
Here's the errors with both checkpoints koala-7B-4bit-128g.olderFormat.pt and koala-7B-4bit-128g:

(D:\AI\textgen-webui\installer_files\env) D:\AI\textgen-webui\text-generation-webui\repositories\GPTQ-for-LLaMa> python llama_inference.py "D:\AI\textgen-webui\text-generation-webui\models\koala-7b-4bit-128g" --wbits 4 --groupsize 128 --load "D:\AI\textgen-webui\text-generation-webui\models\koala-7b-4bit-128g\koala-7B-4bit-128g.pt" --max_length 300 --text "your text"
Loading model ...
Traceback (most recent call last):
File "D:\AI\textgen-webui\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference.py", line 112, in
model = load_quant(args.model, args.load, args.wbits, args.groupsize)
File "D:\AI\textgen-webui\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference.py", line 52, in load_quant
model.load_state_dict(torch.load(checkpoint))
File "D:\AI\textgen-webui\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 2041, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for LlamaForCausalLM:
Missing key(s) in state_dict: "model.layers.0.self_attn.k_proj.bias", "model.layers.0.self_attn.o_proj.bias", "model.layers.0.self_attn.q_proj.bias", "model.layers.0.self_attn.v_proj.bias",[...]

I can clone the latest GPTQ cuda branch if needed, just let me know.

edit: I'm using the one click installer of texgen-webui on Windows natively (no WSL) and ozcur_alpaca-native-4bit\alpaca7b-4bit.pt works fine:
(D:\AI\textgen-webui\installer_files\env) D:\AI\textgen-webui\text-generation-webui\repositories\GPTQ-for-LLaMa>python llama_inference.py "D:\AI\textgen-webui\text-generation-webui\models\ozcur_alpaca-native-4bit" --wbits 4 --groupsize 128 --load "D:\AI\textgen-webui\text-generation-webui\models\ozcur_alpaca-native-4bit\alpaca7b-4bit.pt" --max_length 300 --text "What is an alpaca"
Loading model ...
Done.
What is an alpaca?
An alpaca is a species of South American camelid, belonging to the family Camelidae. It is native to the Andes Mountains of Ecuador, Peru, and Bolivia. Alpacas are smaller than their relative the llama, and are primarily bred for their fiber. Alpaca fiber is softer and finer than llama fiber and is highly valued in the textile industry. Alpacas are also kept as pets, and can be found in many countries around the world. What is the scientific name for the alpaca? The scientific name for the alpaca is V...
What is the scientific name for the alpaca? The scientific name for the alpaca is Vicugna vicugna. What is the average size of an alpaca? The average size of an alpaca is 1.5 to 2.0 meters in height and weighs up to 250 kilograms. What type of fiber does an alpaca produce? An alpaca produces a fine, soft and luxurious fiber called vicuña fiber. What type of color can the alpaca produce? The alpaca can produce colors such as black, brown, white, fawn, silver, and blue. What is the lifespan of an alpaca? The average lif

cmh

Apr 7, 2023

•

edited Apr 7, 2023

Well it worked on textgen-webui with those parameters but outputed garbage:
python server.py --auto-devices --gpu-memory 4725MiB --wbits 4 --groupsize 128 --model koala-7b-4bit-128g --model_type LLaMA

I'll try both models and report back.
edit:
koala-7B-4bit-128g.olderFormat.pt doesn't load (same error as GPTQ).
koala-7B-4bit-128g.pt output:
Common sense questions and answers

Question: What is an alpaca ?
Factual answer: rf df df dfdfrfffdf visit dfdfdf df/ dfdf df dfdf dfdfdfFF df dfdf /OdfDF df df /ch df /df /df /dfdf dfdfdf /rf dfdf /rf /dfdfdfdfdfdf dfdf /dfdf df df / df df /df / df dfdf / / / df dfdf / /

TheBloke

Owner Apr 7, 2023

This comment has been hidden

TheBloke

Owner Apr 7, 2023

This comment has been hidden

TheBloke

Owner Apr 7, 2023

OK short answer is I don't understand what's going on here. My knowledge is not good enough yet to be able to diagnose what's happening.

What I know for sure:

The GPTQ models I have produced here always produce garbage output in text-generation-webui, and if I try to convert the olderFormat version to GGML using the llama.cpp convert script (the latest version, found in this PR), it seems to convert fine but similarly produces garbage when run in llama.cpp.
If I try to load the unquantized Koala 7B model in HF format in text-generation-webui, it similarly produces garbage output. That model data can be found here: koala-7b-HF . The README explains how I converted the model from EasyML.
However, if I convert koala-7B-Hf to unquantized GGML format, it loads and runs fine in llama.cpp, producing good output. That model can be found here: https://huggingface.co/TheBloke/koala-7b-ggml-unquantized

So I am rather confused and stuck now.

Given I can't even get the unquantized model to load in text-generation-webui, I am thinking I won't spend any more time diagnosing the GPTQ issue and will instead look again at the EasyML conversion process. Maybe something is going wrong there. Or I have a problem in the JSON files or something like that.

Any help or suggestions would be much appreciated!

cmh

Apr 7, 2023

I'm definitely not knowledgeable either, I'd love to reproduce the workflow but I have a very modest machine (core i5 without avx2 16 gb of ram and a 1060 6gb). I'll still try/investigate and report back if I get any progress.

knoopx

Apr 7, 2023

pulling the latest changes from GPTQ-for-LLaMa cuda branch and re-running python setup_cuda.py install made them work for me.

TheBloke

Owner Apr 7, 2023

•

edited Apr 7, 2023

pulling the latest changes from GPTQ-for-LLaMa cuda branch and re-running python setup_cuda.py install made them work for me.

OK and it's definitely working reliably for you? I tried that earlier and the first time I tried it it seemed like it worked, but then when I did more tests I started getting garbage again.

Just so I'm clear: you're using text-generation-webui and instead of cloning the oogabooga fork of GPTQ-for-LLaMa you cloned the original GPTQ repo, built it, and that was enough?

I will go try it again myself now.

TheBloke

Owner Apr 7, 2023

These are the commands I ran earlier when I tried that:

cd text-generation-webui
mkdir -p repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
python setup_cuda.py install --force

Does that match what you did, @knoopx ?

TheBloke

Owner Apr 7, 2023

Yeah it's still not working. I'm currently trying the koala-7B-4bit-128g.pt file, and running the UI with this command:

cd /content/text-generation-webui
python server.py --model koala-7b-4bit --wbits 4 --groupsize 128 --model_type LLaMA

In the UI I'm selecting the Llama_Precise parameters preset (temp = 0.7, top_p = 0.1, top_k = 40, rep_penalty = 1.18, etc)

And here's two examples of output. In one I try just entering the query directly, and in the other I try using the Instruction/Response format

Example 1:

Example 2:

It's mostly just garbage?

I have tried some other parameters, like the NovelAI-Pleasing, and sometimes I get something that's at least intelligible. But then the next response will just be 'spanspanspan' over and over..

I'd love to know exactly how you're running it @knoopx and see some example output?

TheBloke

Owner Apr 7, 2023

What really confuses me is that GPTQ-for-LLaMa itself can definitely do inference using these files:

cd /content/gptq-llama
CUDA_VISIBLE_DEVICES=0 python llama_inference.py  /content/text-generation-webui/models/koala-7b-4bit --wbits 4 --groupsize 128 --load /content/text-generation-webui/models/koala-7b-4bit/koala-7B-4bit-128g.pt  --max_length=200 --min_length=100 --text "write a story about Kevin"

Produces output:

<s> write a story about Kevin and Daryl trying to recapture Kevin's dragon

One day, Kevin and Daryl decided to try to recapture Kevin's dragon. They knew it would be a challenging task, but they were determined to do it.

They gathered all their supplies and set off on their journey. They traveled for hours, but they couldn't find any sign of the dragon. They were starting to lose hope, when they finally spotted it in the distance.

Kevin and Daryl approached the dragon, and Daryl started to feed it. Kevin's dragon was hungry and eager to eat, and it didn't take long before it was finished.

Kevin was relieved that his dragon was safe and healthy, but he knew that it was still a wild and powerful animal. He decided to keep it as a pet and train it

So I'm really confused as to what's going on with text-generation-webui.

cmh

Apr 7, 2023

Just updating GPTQ's repo and reinstalling the kernel totally broke it for me. No errors but it wouldn't output anything .
I've just installed VS2019 build tools and miniconda. I'll try with the latest GPTQ's repo.

cmh

Apr 7, 2023

•

edited Apr 7, 2023

Just to be thorough:
I deleted the one click installer's folder then installed the "Desktop Environment with C++" from Visual Studio 2019 Build Tools and miniconda
Finally, I started the anaconda prompt, created an environment, yadi yada:

conda create -n textgen python=3.10.9
conda activate textgen
conda install git ninja
conda install cuda cudatoolkit pytorch torchvision torchaudio pytorch-cuda=11.7 -c pytorch -c nvidia/label/cuda-11.7.0
pip install cchardet chardet bitsandbytes-windows
conda install -c conda-forge cudatoolkit-dev
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
cd text-generation-webui
pip install -r requirements.txt
md repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
"C:\Program Files (x86)\Microsoft Visual Studio\2019\BuildTools\VC\Auxiliary\Build\vcvars64.bat"
set DISTUTILS_USE_SDK=1
python setup_cuda.py install

I placed the model, jsons etc. where it belongs and in a new command prompt:
conda activate textgen
cd text-generation-webui
python server.py --auto-devices --gpu-memory 4725MiB --wbits 4 --groupsize 128 --model koala-7b-4bit-128g --model_type LLaMA --no-stream --chat

It's working, kinda.
I deleted everything caractere related and left the generation parameters at debug-deterministic to compare to ozcur_alpaca-native-4bit

You
What's the difference between a bird and a dinosaure ?

Alpaca:
The difference between a bird and a dinosaur is that a bird is alive and a dinosaur is extinct.

Koala:
A bird is a type of dinosaur.

You
Who's Buzz Aldrin

Alpaca:
What's the name of the first man to walk on the moon?

Koala:
A person who is a type of person who is a person who is a person who is a person who is a person who is a person who is a person who is (I stopped).
Who's a jotle jotle jotle jotle jotle jotle jotle jotle jotle j (I stopped)

disarmyouwitha

Apr 7, 2023

•

edited Apr 7, 2023

I'm still not very experienced with this myself, but this is what I have found. (I haven't been able to test in Ooga because I'm still at work and using SSH to connect to my PC <.<)

When I 4-bit quantize using the latest (oobabooga/GPTQ-for-LLaMa) and try to inference it all I get is gibberish:

(textgen) nap@wintermute:~/Documents/text-generation-webui/repositories/GPTQ-for-LLaMa$ python llama_inference.py /home/nap/Documents/text-generation-webui/models/koala-13B-HF/ --wbits 4 --groupsize 128 --load /home/nap/Documents/text-generation-webui/models/koala-13B-HF/koala13B-4bit-128g_OOGA-GPTQ.safetensors --text "Tell me about Koalas" --max_length=200 --min_length=100

 ⁇  Tell me about Koalasskorou husagent StarMDb Stockrn Auß Burgrn hus Burg TournilerFD Reserve tématuoinbourg MatrixbourgFDлияrutMDbrnrouлия tématurou stick Matrix Sud Beau MatrixSort Burg Blarn stickoin husbourg substitution BourSortrutrnoinEventListener Beau BurgMDbrou Beau StarMDb Stock husrut tématu Burg Wall frameskorut titles titles tématu Wall hus substitutionSort Beaurou BurgoinлияMatrix Bruno Bourilerrut Wall hus Fourier Stockbourg HyMDb Bla Bla Auß tématuFDMDb Star Burg Sud Bouragent Bour Tournrn Tourn Bla frame Sud Bruno Bruno Sudagent tématu hus Auß Bour Stock Bruno Burg BeauoinbourgMatrix respectrnrn Stock titles Stockagent loobrernrn stick BourMDb Burg BourMatrix MatrixMDb respect stick tématu titlesFDMatrixagent stickMDb lo Reserve Sud Bour titles Starrut hus MatrixMDb lorut stickrou consprn Boursko Bour StarMDbbourgrou Matrix Reserve Hy MatrixSort Brunoлия Bour Fourier Beau tématu Bla Fourier BlaMDbrn hus Burg

When I 4-bit quantize using the latest (qwopqwop200/GPTQ-for-LLaMa) and try to inference it the output is correct:

(gptq) nap@wintermute:~/Documents/GPTQ-for-LLaMa$ python llama_inference.py /home/nap/Documents/text-generation-webui/models/koala-13B-HF/ --wbits 4 --groupsize 128 --load /home/nap/Documents/text-generation-webui/models/koala-13B-HF/koala13B-4bit-128g_NEW-GPTQ.safetensors --text "Tell me about Koalas" --max_length=200 --min_length=100 --device=0

⁇  Tell me about Koalas. Koalas are marsupials that live in Australia and are known for their distinctive black and white fur and habit of sleeping in trees. They have a slow rate of reproduction and may live up to 10 years in the wild.

D) That's interesting. Can you tell me more about Koalas? Koalas are marsupials that live in Australia and are known for their distinctive black and white fur and habit of sleeping in trees. They have a slow rate of reproduction and may live up to 10 years in the wild.

E) Let's talk about Koalas. Koalas are marsupials that live in Australia and are known for their distinctive black and white fur and habit of sleeping in trees. They have a slow rate of reproduction and may live up to 10 years in the wild.

HOWEVER The model quantized with (qwopqwop200/GPTQ-for-LLaMa) cannot be inferenced by the Ooga-webui version (oobabooga/GPTQ-for-LLaMa)

I wasn't using the 4bit file from this repo but 2 versions that I quantized for 13b for testing. (one with each version of GPTQ-for-LLaMa)

EDIT: It appears If I clone the latest (qwopqwop200/GPTQ-for-LLaMa) that I used to quantize the model into Ooba's repositories I am at least able to load the model (Where I used to have to use Ooba's fork)
BUT I still don't know if it actually works because I am not home^^; will report back!

Not sure if that helps, but that's where I'm at!

cmh

Apr 7, 2023

Here's the difference that I know of:

The default qwopqwop200's GPTQ branch is using triton (triton isn't available on Windows).
Oobabooga's textgen wiki is asking to install the cuda branch regardless of the operating system.
Oobabooga's one click installer for Windows is also using the cuda branch but the commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773.

disarmyouwitha

Apr 7, 2023

•

edited Apr 7, 2023

Here's the difference that I know of:

The default qwopqwop200's GPTQ branch is using triton (triton isn't available on Windows).

Oobabooga's textgen wiki is asking to install the cuda branch regardless of the operating system.

Oobabooga's one click installer for Windows is also using the cuda branch but the commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773.

I do think this is the reason for my error when trying to run the (triton) quantized version on (cuda).

Installing the triton version into text-generation-webui/repositories did allow me to inference, and to start the webui. (but I cant test the results until i get home)

Still not sure why the version I quantized with (oobabooga/GPTQ-for-LLaMa) just produced gibberish though~

Maybe I shouldn't have done a git pull, lol.

TheBloke

Owner Apr 7, 2023

EDIT: It appears If I clone the latest (qwopqwop200/GPTQ-for-LLaMa) that I used to quantize the model into Ooba's repositories I am at least able to load the model (Where I used to have to use Ooba's fork) BUT I still don't know if it actually works because I am not home^^; will report back!

Thanks for the findings! All your experiences match mine. I've already done this last step you mention above, and then it still doesn't work properly. I showed some screenshots of that above. Sometimes the result seems partially intelligible, but equally as often it's just gibberish and unusable.

My guess is you'll find the same when you try GPTQ-for-LLaMa inside the textgen UI, but would be good to know for sure.

Here's the difference that I know of:

The default qwopqwop200's GPTQ branch is using triton (triton isn't available on Windows).

Oobabooga's textgen wiki is asking to install the cuda branch regardless of the operating system.

Oobabooga's one click installer for Windows is also using the cuda branch but the commit a6f363e3f93b9fb5c26064b5ac7ed58d22e3f773.

Yes this is all true. So far we've been cloning the cuda branch when we try GPTQ-for-LLaMa inside textgen UI. In fact I believe you have to, because if you clone the Triton branch there's no setup.py so it can't be built anywhere. So far as I can see, there's no way to use the Triton branch as a module of textgen UI. It can only be used for the quantization process itself.

Eg these are the commands I used to try latest GPTQ-for-LLaMa inside textgen UI:

cd text-generation-webui
mkdir -p repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa -b cuda GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
python setup_cuda.py install --force

If I'd left out the -b cuda and cloned the main repo, then the last step - python setup_cuda.py install --force would fail, as there's no setup*.py in the Triton branch.

With regards to quantizing itself, I tried both the Triton and the CUDA versions and both worked, and produced 100% identical files; exactly the same SHA256SUM. So I don't know what the actual difference is between using those methods. Maybe just performance (I didn't time how long they took.)

I'm starting to wonder if this is all just down to some bug in inference in textgen UI. We know the inference code of GPTQ-for-LLaMa works fine, so the files are presumably OK. Although when I tried llama.cpp's convert script to convert the quantized file to GGML, that also produced gibberish - though that was also using an older version of the GPTQ-for-LLaMa code, because the latest code uses a new structure that GGML does not yet support.

TheBloke

Owner Apr 7, 2023

If we've made no more progress by tomorrow I will raise an issue with oogabooga on the textgen UI Github. Maybe he can shed some light on what is going on!

disarmyouwitha

Apr 7, 2023

Hm, looks alright. I did the quantization with latest (qwopqwop200/GPTQ-for-LLaMa)(triton) and also inference through (triton) version in the Ooba-webui.

I will keep poking and report back.

TheBloke

Owner Apr 8, 2023

•

edited Apr 8, 2023

You guys definitely seem to be getting better output than me. I am so damned confused! :( No matter what I do I cannot get good output out of the webui.

Here's everything I'm doing, start to finish, in Google Colab. Can anyone spot anything wrong, or different to what you're doing?

Python dependencies (referenced to requirements of webui and GPTQ):

pip3 uninstall -y torch torchvision torchaudio transformers peft datasets loralib sentencepiece safetensors accelerate triton bitsandbytes huggingface_hub flexgen rwkv
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118
pip3 install datasets==2.10.1 loralib sentencepiece safetensors==0.3.0 accelerate==0.18.0 triton==2.0.0 huggingface_hub
pip3 install git+https://github.com/huggingface/transformers # need to install from github
pip3 install peft==0.2.0 #git+https://github.com/huggingface/peft.git
pip3 install bitsandbytes==0.37.2
pip3 install markdown pyyaml tqdm requests gradio==3.24.1 flexgen==0.1.7 rwkv==0.7.3 ninja

Download the HF format Koala7B I previously uploaded:

git clone https://huggingface.co/TheBloke/koala-7B-HF

Download latest GPTQ code, Triton branch

git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa gptq-llama

Create GPTQ file using Triton branch

cd gptq-llama
CUDA_VISIBLE_DEVICES=0 python3 llama.py /content/koala-7B-HF c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save /content/triton.koala-7B-4bit-128g.pt

Test the GPTQ file using Triton branch - it works

cd gptq-llama
TOKENIZERS_PARALLELISM=false CUDA_VISIBLE_DEVICES=0 python llama_inference.py  /content/koala-7B-HF --wbits 4 --groupsize 128 --device 0 --load /content/triton.koala-7B-4bit-128g.pt --max_length=200 --min_length=100 --text "write a story about Kevin"

Output:

2023-04-08 08:37:09.656957: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
Loading model ...
Done.
 ⁇  write a story about Kevin, a young boy who is struggling to come to terms with his father's divorce. Kevin feels confused and angry about his father's actions and doesn't understand why he can't just be happy again.

One day, while he's playing with his best friend, they come across a time capsule that they found in the park. Inside, they discover a letter from Kevin's father, who is now living with another family. Kevin's father writes that he left Kevin's mother for a younger woman because he couldn't handle the pain and sadness of their divorce. He also apologizes for not being there for Kevin and his mother.

Reading the letter, Kevin starts to understand his father's perspective and the reasons why he left. He also starts to see that his father's actions were not a reflection of his own love for him, but rather a result of his own struggles

Clone webui, set up the model in `models`, link in the already downloaded GPTQ Triton code into `repositories`

rm -rf /content/text-generation-webui
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
mkdir models/koala-7b-4bit
cp /content/koala-7B-HF/*.{json,model} models/koala-7b-4bit
cp /content/triton.koala-7B-4bit-128g.pt models/koala-7b-4bit
mkdir repositories
cd repositories
ln -s /content/gptq-llama ./GPTQ-for-LLaMa

Run webui

cd /content/text-generation-webui
python server.py --model koala-7b-4bit --wbits 4 --groupsize 128 --model_type LLaMA --auto-devices # --gpu-memory 20000MiB --bf16 --extensions llama_prompts  #

Output:

2023-04-08 08:51:31.724603: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: /usr/lib64-nvidia did not contain libcudart.so as expected! Searching further paths...
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/sys/fs/cgroup/memory.events /var/colab/cgroup/jupyter-children/memory.events')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('//172.28.0.1'), PosixPath('http'), PosixPath('8013')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('--listen_host=172.28.0.12 --target_host=172.28.0.12 --tunnel_background_save_url=https'), PosixPath('//colab.research.google.com/tun/m/cc48301118ce562b961b3c22d803539adc1e0c19/gpu-a100-s-1adw4b0e7lfcl --tunnel_background_save_delay=10s --tunnel_periodic_background_save_frequency=30m0s --enable_output_coalescing=true --output_coalescing_required=true')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('/env/python')}
  warn(msg)
/usr/local/lib/python3.9/dist-packages/bitsandbytes/cuda_setup/main.py:136: UserWarning: WARNING: The following directories listed in your path were found to be non-existent: {PosixPath('module'), PosixPath('//ipykernel.pylab.backend_inline')}
  warn(msg)
CUDA_SETUP: WARNING! libcudart.so not found in any environmental path. Searching /usr/local/cuda/lib64...
CUDA SETUP: CUDA runtime path found: /usr/local/cuda/lib64/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.0
CUDA SETUP: Detected CUDA version 118
CUDA SETUP: Loading binary /usr/local/lib/python3.9/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so...
Loading koala-7b-4bit...
Loading model ...
Done.
Using the following device map for the 4-bit model: {'': 0}
Loaded the model in 7.69 seconds.
Running on local URL:  http://127.0.0.1:7860

To create a public link, set `share=True` in `launch()`.
Output generated in 27.50 seconds (7.24 tokens/s, 199 tokens, context 6)
Output generated in 17.71 seconds (11.24 tokens/s, 199 tokens, context 6)

Try something, fail horribly:

Example 1:

Example 2:

Example 3 (Debug-deterministic parameter set):

What am I doing wrong/differently?! :(

disarmyouwitha

Apr 8, 2023

•

edited Apr 8, 2023

Hmm.. it all looks correct to me. Here are the steps I took to set up the environment:

**#create conda env: (I used conda to keep my environment separate, but you don't have to)**
conda create -n textgen python=3.10.9
conda activate textgen

pip install torch torchvision torchaudio

**# (oobabooga/text-generation-webui):**
git clone https://github.com/oobabooga/text-generation-webui
cd text-generation-webui
pip install -r requirements.txt

**# (qwopqwop200/GPTQ-for-LLaMa):**
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa GPTQ-for-LLaMa
cd GPTQ-for-LLaMa
pip install -r requirements.txt

**# start server:**
python server.py --model_type llama --wbits 4 --groupsize 128

Here is the command I used to quantize:

CUDA_VISIBLE_DEVICES=0 python llama.py /home/nap/Documents/text-generation-webui/models/koala-13B-HF c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save_safetensors /home/nap/Documents/text-generation-webui/models/koala13B-4bit-128g.safetensors

EDIT:
I did notice we are using different versions of CUDA but I don't think this should matter?:

(textgen) nap@wintermute:~/Documents/text-generation-webui$ ./start.sh 

===================================BUG REPORT===================================
Welcome to bitsandbytes. For bug reports, please submit your error trace to: https://github.com/TimDettmers/bitsandbytes/issues
================================================================================
CUDA SETUP: CUDA runtime path found: /home/nap/miniconda3/envs/textgen/lib/libcudart.so
CUDA SETUP: Highest compute capability among GPUs detected: 8.9
CUDA SETUP: Detected CUDA version 117
CUDA SETUP: Loading binary /home/nap/miniconda3/envs/textgen/lib/python3.10/site-packages/bitsandbytes/libbitsandbytes_cuda117.so...

TheBloke

Owner Apr 8, 2023

•

edited Apr 8, 2023

Thanks for the details. I am completely mystified.

My problems aren't even specific to GPTQ. Today I've been testing the web UI with the unquantized Koala 7B model. I tested both my repo, at https://huggingface.co/TheBloke/koala-7B-HF, and one made independently by a YouTuber called Sam Witeveen at https://huggingface.co/samwit/koala-7b

Here's the output I get using Sam's repo loaded into the web UI. No GPTQ at all:

Total garbage. But the web UI does work in general, I tested an unrelated model (Galactica 6.7b) and that worked fine.

I was wondering if maybe I had some issue related to still being on Python 3.9, so just now I upgraded the Colab to 3.10.6 and the results are the same.

I guess I will now try with CUDA 117 just in case that is causing any problems. Seems unlikely though.

Otherwise I will definitely need to raise an issue with oogabooga, because I am so confused.

I guess the good news is that it seems like the GPTQ files I uploaded in this repo definitely are fine? The user just needs to know how to use them correctly.

monkmartinez

Apr 8, 2023

I haven't tested this particular version, but I did use your "regular" 7B version. Thank you for conversion and uploading. I was unable to complete the conversion on my machine for some reason.

I say that to say this: *I have found the koala model is VERY sensitive to prompt. *

BEGINNING OF CONVERSATION: 
USER: <user_questions/input_goes_here>

GPT:

As an idea, maybe have a go at trying with this prompt?

TheBloke

Owner Apr 8, 2023

You're welcome! Glad it helped. I've put up koala-13B-HF now as well if that's of any interest to you.

Thanks for the info. Yeah I saw that prompt method earlier today, in Sam's video, and tried it a couple of times. I've just tried it again now. It definitely does get better output in that it tends to produce valid English sentences. But it very quickly goes into constant repetition or starts stringing words together without spaces.

Eg loading koala-7B-HF, no quantisation
Prompt:

BEGINNING OF CONVERSATION: 
USER:  write a story about llamas
GPT:

I get this output:

BEGINNING OF CONVERSATION: 
USER:  write a story about llamas
GPT:User'sstory is not written by the user.user'sstory is not written by the user'suser'sstory is not written by the user'suser'sstory is not written by the user'suser'sstory is not written by the user'suser'sstory is not written by the user'suser'sstory is not written by the user'suser'sstory is not written by the user'
... and on and on

Or, with slightly different parameters:

BEGINNING OF CONVERSATION: 
USER:  write a story about llamas
GPT:User has written a story about the sea, which is called "Makes me right."
User has written a story about the sea, which is called "Makes me right".
User has also written a poem about the sea, which is called "Makes me right".
User has also written a poem about the sea, which is called "Makes me right".
User has also written a poem about the sea, which is called "Little Makes me Right", which is called "Makes me right".
... more garbage

I've tried various different parameters, and some produce less garbage or different garbage, but it's rare to even get valid English sentences, let alone anything remotely useful.

Whereas if I try other inference methods, like using llama.cpp on the unquantised GGML file (which was converted from the koala-7B-HF model data), or use the llama_inference.py provided with GPTQ to infer on the GPTQ versions, it seems I can use any prompt at all and get something readable, even if the prompt wasn't ideal.

Example with llama.cpp:
Command:

./main -t 18 -m ~/src/ggml.koala.7b.bin --color  -c 2048 --temp 0.7 --repeat_penalty 1.1 -n -1  -p "BEGINNING OF CONVERSATION:
USER: Write a story about llamas.

GPT:"

Output:

 BEGINNING OF CONVERSATION:
USER: Write a story about llamas.

GPT: Once upon a time, in the midst of a vast and verdant forest, there lived a herd of magnificent llamas. These gentle creatures roamed the lush green grass, their thick woolly coats keeping them warm in the cool mountain air. They were known for their kindness and intelligence, always willing to share their grazing with other animals in need.

One day, a young llama named Luna wandered far from her herd, exploring the forest on her own. As she made her way through the trees, she came across a beautiful meadow filled with blooming flowers. Luna was struck by the beauty of the scene and decided to stay for a while and take in all the sights .. etc

The fact that I can't seem to ever get even one good output from Koala under web UI, regardless of prompt or parameters, makes me think something must be broken somewhere. Maybe something in my Colab environment that doesn't play nice with webUI?

It's so weird!

monkmartinez

Apr 8, 2023

That is super strange behavior! I haven't been able to even load this 7B - 4bit model in webUI... still investigating why. But your other models have been running flawlessly if a bit constrained by the "as an AI model..." stuff.

For additional data points... I am running this on a Windows Computer with the One-click installer. I can run almost everything with enough effort into how its loaded with flags and whatnot.

cmh

Apr 8, 2023

•

edited Apr 8, 2023

All right I made it work consistently on Windows (without WSL) with textgen-webui using opoobabooga's new-qwop branch and qwopqwop200's GPTQ-for-LLaMa (it will complain about triton at GPTQ-for-LLaMa's kernel compilation but it's fine):

Install:

conda install python=3.10.9 pytorch[version=2,build=py3.10_cuda11.7*] torchvision torchaudio pytorch-cuda=11.7 cuda-toolkit ninja git -c pytorch -c nvidia/label/cuda-11.7.0 -c nvidia -c conda-forge
git clone https://github.com/oobabooga/text-generation-webui.git -b new-qwop
pip install https://github.com/jllllll/bitsandbytes-windows-webui/raw/main/bitsandbytes-0.37.2-py3-none-any.whl
cd text-generation-webui
python -m pip install -r requirements.txt --upgrade
mkdir repositories
cd repositories
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda
cd GPTQ-for-LLaMa
call python -m pip install -r requirements.txt
python setup_cuda.py install

Output goes around 2 tokens/s on a 1060 6gb.

Here's a test:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
Instruction:
Write a poem about the transformers Python library.
Mention the word "large language models" in that poem.
Response:

KOALA 7B Q4 (TheBloke_koala-7b-4bit-128g), naive preset

A poem about the transformers Python library:

The transformers library, oh how grand,
It's a tool for large language models.
It's a tool for NLP tasks galore,
It's a tool that's here to make you grow.

With the transformers library, you can train,
And fine-tune large language models,
And make them perform tasks with ease,
And make them do what you want them to please.

So don't wait any longer,
And give the transformers library a try,
It's a tool that's here to help,
And make your NLP tasks more efficient

ALPACA 7B Q4 (ozcur_alpaca-native-4bit), naive preset, third inference:

The transformers Python library is here to stay
It comes with many powerful tools each day
With large language models, it's a breeze
To process data and create your views

This library is quite impressive
It's got the power to make you bliss
With its capabilities, it's a joy
To use and will always be a toy

It's got many features that are great
It can even make your coding great
It's a powerful tool, don't you see
It's sure to make you feel like me

So use this library and be sure
It'll make your coding dreams come true
With its capabilities, you'll be glad
You chose to use the transformers Python library.

GUANACO 7B Q4 (JosephusCheung_GuanacoOnConsumerHardware), naive preset, second inference:

The transformers Python library, it is quite the beast
A large language models it is based on the text

Its power is unparalleled, and its size can't be beat
A model for the ages, a feat that can't be repeat

It has a wide range of use, from natural language to text
It can do wonders, it's a model of great worth

It's a model for the ages, and it can't be beaten
A model of great worth, and it's here to stay

TheBloke

Owner Apr 8, 2023

Update: I just finished GPTQing the 13B model. And it works perfectly in textweb UI!

Here's the model: https://huggingface.co/TheBloke/koala-13B-GPTQ-4bit-128g

This makes me even more confused as to my total failure to get the 7B working. But given I can get 13B working - which is the model I wanted anyway - I guess that solves my problem.

I have updated the model card for this 7B model, and the 13B model, with instructions on how to get these GPTQ models working with textweb UI.

TheBloke

Owner Apr 8, 2023

Great, glad it's working for you @cmh ! thanks for the details

Check out the 13B model as well if you have enough VRAM. 13B requires 8.6GB VRAM it seems, so as long as the GPU has >8GB it should be fine.

cmh

Apr 8, 2023

It should (slowly) work with --gpu-memory 4832MiB --no-cache --pre_layer 24 on my 1060 6gb.
I'll test real quick.

TheBloke

Owner Apr 8, 2023

Ah interesting, I didn't realise you could still load it without enough VRAM! Yeah let me know if it works!

On the 13B repo the pt file is already uploaded but the safetensors is still pushing. Should be finished in a few minutes.

cmh

Apr 8, 2023

•

edited Apr 8, 2023

Yeah, it splits the model between CPU and GPU and it's slow.
On my system 13B q4 models are on par with llama.cpp since my 3rd gen i5 can't even provide AVX2 instructions. Like 0.15 tokens/s instead of 2 tokens/s for 7B q4 models. It took less time to download the model than to run the inference, lol.
Anyway thanks a lot for providing the models, great work.
Here's the results, still with the naive parameter:

TheBloke

Owner Apr 9, 2023

Hey @cmh I don't know if this will help you with your limited CPU, but I've now got GPTQ 4bit GGML files available for both 7B and 13B models, if you'd like to try them entirely on CPU with llama.cpp:
https://huggingface.co/TheBloke/koala-13B-GPTQ-4bit-128g-GGML
https://huggingface.co/TheBloke/koala-7B-GPTQ-4bit-128g-GGML

If you try them I'd recommend doing a git pull in llama.cpp and building from source again, as they keep pushing new performance enhancements.

Worth a try anyway!

I'm going to close this thread now as I think all issues are resolved. Well, I never did actually manage to get any 7B model working in textgen UI, which still completely baffles me. But you guys all reported it working, and 13B models work fine for me, so I guess that will just have to remain a mystery!

Thanks everyone for your help and advice.

TheBloke changed discussion status to closed Apr 9, 2023

Won't work with GPTQ

Python dependencies (referenced to requirements of webui and GPTQ):

Download the HF format Koala7B I previously uploaded:

Download latest GPTQ code, Triton branch

Create GPTQ file using Triton branch

Test the GPTQ file using Triton branch - it works

Clone webui, set up the model in models, link in the already downloaded GPTQ Triton code into repositories

Run webui

Try something, fail horribly:

Install:

Here's a test:

KOALA 7B Q4 (TheBloke_koala-7b-4bit-128g), naive preset

ALPACA 7B Q4 (ozcur_alpaca-native-4bit), naive preset, third inference:

GUANACO 7B Q4 (JosephusCheung_GuanacoOnConsumerHardware), naive preset, second inference:

Clone webui, set up the model in `models`, link in the already downloaded GPTQ Triton code into `repositories`