Cannot load the model

#2
by horstao - opened

When trying to load the model directly from HF, I get this error :

OSError: TheBloke/vicuna-13B-1.1-GPTQ-4bit-128g does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack

I already have the Vicuna-13b 1.1 model downloaded, so I tried running the llama.py command:

python llama.py \
<path-to-vicuna-13b> c4 --wbits 4 \
--true-sequential --act-order --groupsize 128 \
--save_safetensors vicuna-13B-1.1-GPTQ-4bit-128g.safetensors

And I get this error:

Token indices sequence length is longer than the specified maximum sequence length for this model (3908 > 2048). Running this sequence through the model will result in indexing errors
Starting ...

KeyError: 'position_ids'

Could you give a hint of what is the problem ?

I managed to load the model, but I'm only getting gibberish responses.

Could you give a hint of what is the problem ?
I've never tried loading it direct from HF. Are you loading in text-generation-webui, and have you set up the --wbits and --groupsize settings?

As for the GPTQ error, not sure, I've never seen that KeyError. Something is set up wrong, but I can't say what.

I managed to load the model, but I'm only getting gibberish responses.

Please see the README. This means you're using the safetensors file without the latest GPTQ-for-LLaMa code. Either update GPTQ-for-LLaMa, or else use the pt file instead.

Also if you don't know you need to move your GPTQ-for-LLaMa off of the cuda branch and onto the triton (main) branch. Probably wanna reclone the repo for GPTQ in the repositories folder if you're using oogabooga and reinstall all those requirements. That's how I stopped the jibberish.

Also if you don't know you need to move your GPTQ-for-LLaMa off of the cuda branch and onto the triton (main) branch. Probably wanna reclone the repo for GPTQ in the repositories folder if you're using oogabooga and reinstall all those requirements. That's how I stopped the jibberish.

You shouldn't have to. The CUDA branch will work, as long as you rebuild the CUDA kernel with python setup_cuda.py install --force. But yes the Triton branch is preferable.

I have your model running in the latest text-generation-webui version with the original GPTQ-for-LLaMA (cuda) that you specified for Windows in the readme, and the latest dependencies. However, with this GPTQ, text-generation-webui is frustratingly slower on my system (GTX 1080 Ti) for some reason, this affects both models below.
With oobabooga's GPTQ-for-LLaMA fork installed and using similar (or not?) "jeremy-costello/vicuna-13b-v1.1-4bit-128g" model, token generation speed on my system is several times faster, but in case I try to run TheBloke's solution with this fork installed, I get the same gibberish output (token generation speed remains the same though) that many people already faced.

It turns out that at a minimum we need to inform text-generation-webui developers about this problem and get them to fix the oobabooga\GPTQ-for-LLaMA fork (the one that definitely works with CUDA on Windows when webui is installed via "one-click" method), and not only.

Hmm that's odd. When did you download the model files from my repo here?

There was an issue with my Vicuna-13B-1.1-HF repo, caused by a bug in the Transformers code for converting from the original Llama 13B to HF format. My HF repo was 50% too big as a result. I fixed that about 20 hours ago. I did not think it would affect my GPTQ conversions, but just in case I also re-did the GPTQs.

So when did you download the files? If it was more than 20 hours ago, could you try downloading them again and let me know if you see the same slowdown vs Jeremy's repo?

I looked at his repo and so far as I can see, he used exactly the same parameters as I did.

Hmm that's odd. When did you download the model files from my repo here?

There was an issue with my Vicuna-13B-1.1-HF repo, caused by a bug in the Transformers code for converting from the original Llama 13B to HF format. My HF repo was 50% too big as a result. I fixed that about 20 hours ago. I did not think it would affect my GPTQ conversions, but just in case I also re-did the GPTQs.

So when did you download the files? If it was more than 20 hours ago, could you try downloading them again and let me know if you see the same slowdown vs Jeremy's repo?

I looked at his repo and so far as I can see, he used exactly the same parameters as I did.

I cloned your model repository 4 hours ago, I already have the latest version installed. I checked all possible options, reinstalled all text-generation-webui dependencies and GPTQ-for-LLaMA, and with default qwopqwop200/GPTQ-for-LLaMA (cuda branch) your model (and also the version from jeremy-costello) seems to work fine, though slower than with webui adapted version of GPTQ. This version of GPTQ is certainly not optimised, but I don't know what I can do about it.

What I can't understand is why Jeremy's model would be noticeably faster than mine. So far as I can tell they've been created with identical parameters, and using the same GPTQ-for-LLaMa commit.

The only difference I can notice is that his is a pt file and mine is safetensors. I thought safetensors was meant to be as fast or even faster, but I don't know for sure. I do have the no-act-order.pt file - did you try that as well?

What I can't understand is why Jeremy's model would be noticeably faster than mine. So far as I can tell they've been created with identical parameters, and using the same GPTQ-for-LLaMa commit.

If you compare the performance of both models on one GPTQ and their performance on the other, no, not faster.

The only difference I can notice is that his is a pt file and mine is safetensors. I thought safetensors was meant to be as fast or even faster, but I don't know for sure.

There are already so many vicuna-13B-v1.1-GPTQ-4bit-128g models on huggingface that it's driving me nuts.
But even so - no, using the same dependencies, the same webui version and the same GPTQ version, the generation speed is almost no different from one model to another.

I do have the no-act-order.pt file - did you try that as well?

No, but I will try it now.

Oh! Well if you're getting the same performance on my model as you are on Jeremy's, then there's no mystery.

If your problem is that you wish to use ooba's fork and you cannot do that with the safetensors file, then yes the solution is to use the no-act-order.pt which will work fine for you.

As to why you get poor performance when using the later GPTQ-for-LLaMa code - did you run python setup_cuda.py install? If you installed the later GPTQ-for-LLaMa but didn't compile the CUDA kernel, that would certainly explain a significant slowdown.

But if in doubt, just use the no-act-order.pt file which will work with any version of GPTQ-for-LLaMa.

As to why you get poor performance when using the later GPTQ-for-LLaMa code - did you run python setup_cuda.py install? If you installed the later GPTQ-for-LLaMa but didn't compile the CUDA kernel, that would certainly explain a significant slowdown.

I uninstalled everything completely and ran the commands "git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa.git -b cuda" and "python setup_cuda.py install --force" via install.bat (setup_cuda.py was in the GPTQ-for-LLaMa folder). This .bat can also install from wheel "https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl" if "CUDA kernal compilation failed" due to not having Visual C++ Build Tools installed, but I had it anyway.
The process completed successfully, with no errors. But generating tokens with this GPTQ when running no matter what model, for reasons unknown to me, was slower (100% GPU load tho) than if I ran "git clone https://github.com/oobabooga/GPTQ-for-LLaMa.git -b cuda" and "python setup_cuda.py install --force".

But if in doubt, just use the no-act-order.pt file which will work with any version of GPTQ-for-LLaMa.

I checked vicuna-13B-1.1-GPTQ-4bit-128g.no-act-order.pt on oobabooga-fork. Everything is fine, no problems. As soon as I change the model to vicuna-13B-1.1-GPTQ-4bit-128g.safetensors, I immediately get gibberish on the output.
The only thing that bothers me a bit is the fact that no-act-order.pt version in theory (from your own words) may produce somewhat worse results.


Here's some tests of these models.
t-g-webui (ooba-GPTQ (cuda) repository), default settings and prompt. 100% GPU load everywhere.

"Hello, say something about yourself."

Your versions.

no-act-order.pt:

Output generated in 7.64 seconds (8.64 tokens/s, 66 tokens, context 41, seed 1229294013)

.safetensors:

(gibberish) output generated in 21.44 seconds (9.28 tokens/s, 199 tokens, context 41, seed 263917108)

jeremy-costello version.

vicuna-13b-v1.1-4bit-128g (.pt):

(gibberish... wait, what. Did I broke something again...) output generated in 20.82 seconds (9.56 tokens/s, 199 tokens, context 42, seed 1443451655)


t-g-webui (original GPTQ (cuda) repository), default settings and prompt. 100% GPU load everywhere.

"Hello, say something about yourself."

Your versions.

no-act-order.pt:

Output generated in 15.99 seconds (2.13 tokens/s, 34 tokens, context 41, seed 198665102)

.safetensors:

Output generated in 21.75 seconds (1.98 tokens/s, 43 tokens, context 41, seed 863730907)

jeremy-costello version.

vicuna-13b-v1.1-4bit-128g (.pt):

Output generated in 58.74 seconds (2.15 tokens/s, 126 tokens, context 41, seed 47935661)

act-order was one of two new methods added to GPTQ to improve the quantisation quality. To be exact, they said this about it:

Two new tricks:--act-order (quantizing columns in order of decreasing activation size) and --true-sequential (performing sequential quantization even within a single Transformer block). Those fix GPTQ's strangely bad performance on the 7B model (from 7.15 to 6.09 Wiki2 PPL) and lead to slight improvements on most models/settings in general.

Given you're looking at the 13B model, it looks like only 'slight' improvements are to be expected. No further details on what that means in practice.

So I would say that ideally one would use --act-order if possible, but if you can't, there's no reason to think you're going to get "bad" output on a 13B model.

I have no idea why you're getting slow output on the re-compiled CUDA branch. I noticed no performance problems when I tested the CUDA branch on Linux. The most likely explanation would seem to be that it's not using the compiled CUDA kernel for some reason, but that's just a guess and as I've never tried any of this on Windows I couldn't hazard a guess as to what else might be wrong.

I just saw your edit. I'd say it's expected that you'd get gibberish from Jeremy's model. I was surprised when you said that you didn't. According to his README he's using the same settings as me, including --act-order, which means it requires newer GPTQ-for-LLaMa code to use. He has not provided a no-act-order version like I did.

By the way, in case you didn't know you'll usually get higher quality responses if you use this prompt template:

Below is an instruction that describes a task. Write a response that appropriately completes the request.
### Instruction:
prompt goes here
### Response:

@Morivy - just wanted to point out that there is also https://github.com/jllllll/GPTQ-for-LLaMa-Wheels/raw/main/610fdae6588c2b17bcf2726cacaaf795cd45077e/quant_cuda-0.0.0-cp310-cp310-win_amd64.whl "610fdae wheel compiled from latest (as of writing) commit of GPTQ-for-LLaMa" - https://github.com/jllllll/GPTQ-for-LLaMa-Wheels

Your issue is described here: https://github.com/qwopqwop200/GPTQ-for-LLaMa/issues/128
Also, cf. README "Unless you are using 3bit, i recommend using a branch that currently supports triton." - https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/cuda

I recommend you switch to the triton branch via WSL.

@Morivy @Thireus
Sadly, qwop's cuda branch has huge performance regressions after implementing the ability for simultaneous use of true-sequential/act-order/groupsize, even when you don't run a model using those 3 together. With development focusing on Triton now (despite the lack of Windows compatibility and the lack of compatibility with pre-RTX Nvidia cards) that's left a lot of people like myself stuck on older versions of GPTQ. Neither WSL2 nor Linux have proven to be great solutions for me. WSL2 takes 7+ minutes to load a 30b model into VRAM even when I configure it for full access to system ram and a huge swap space. Running natively on Linux with the Triton branch of GPTQ on Ooba causes a 4bit non-groupsized 30b model to run out of VRAM at only 1300ctx on my 3090, even when the only other active process using VRAM is X (using only 4MiB). The same model on the same system is fine at full ctx with an older GPTQ cuda in Windows.

The no-act-order model provided here works fine in the Ooba fork of GPTQ. Unfortunately, it's too new to work in the 0cc4m fork of GPTQ.

Model is worthless and spits out gibberish.

Model is worthless and spits out gibberish.

image.png

I just added instructions for easily downloading the model with a few clicks from text-generation-webui

I also renamed the model files so the compatible file will load in preference to the one that requires latest GPTQ-for-LLaMa code.

Delete the model you already downloaded and follow these instructions:

image.png

I just added instructions for easily downloading the model with a few clicks from text-generation-webui

I also renamed the model files so the compatible file will load in preference to the one that requires latest GPTQ-for-LLaMa code.

Delete the model you already downloaded and follow these instructions:

image.png

I updated already. Still getting gibberish.

Alright I had the same problem with gibberish but did not understand the explanation to update GPTQ-for-LLaMa as it seems at first I was up-to-date with the latest version.
But I realized that my repository in repositories/GPTQ-for-LLaMa was in fact pointing to a fork for oobabooga: https://github.com/oobabooga/GPTQ-for-LLaMawhich was not updated for some time.
The solution then was to rename the existing GPTQ-for-LLaMa folder to something else, then clone from the correct repo:
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

I did then a pip install -r requirements.txt --upgrade in my Python env and after restarting the server now it seems to work correctly.

Alright I had the same problem with gibberish but did not understand the explanation to update GPTQ-for-LLaMa as it seems at first I was up-to-date with the latest version.
But I realized that my repository in repositories/GPTQ-for-LLaMa was in fact pointing to a fork for oobabooga: https://github.com/oobabooga/GPTQ-for-LLaMawhich was not updated for some time.
The solution then was to rename the existing GPTQ-for-LLaMa folder to something else, then clone from the correct repo:
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

I did then a pip install -r requirements.txt --upgrade in my Python env and after restarting the server now it seems to work correctly.

Something must be up, I followed your instructions to a tee, still getting gibberish.

Pip install is 2.23.1, einops have been updated, oobabooga updated... what am I missing?

What model files do you have in my model dir?

If you updated GPTQ-for-LLaMa like leszekhanusz described then you should be able to use either model file. But try removing the latest.safetensors file and trying again. Then you'll be using the compatible file which should work with any/all versions of GPTQ-for-LLaMa.

What model files do you have in my model dir?

If you updated GPTQ-for-LLaMa like leszekhanusz described then you should be able to use either model file. But try removing the latest.safetensors file and trying again. Then you'll be using the compatible file which should work with any/all versions of GPTQ-for-LLaMa.

Finally got it to work... Don't know if it was because I removed the old "repositories/GPTQ-for-LLaMa" folder completely or putting the "vicuna-13B-1.1-GPTQ-4bit-128g.compat.no-act-order" file in the models folder but whatever did it, it's finally not putting out gibberish.

Great, glad it's working now!

Sign up or log in to comment