Detailed problem, need support <3

#9
by Ukro - opened

Hello :)
Updated oobogabooga through install
And downloaded GPTQ-for-LLaMa-58c8ab4c7aaccc50f507fd08cce941976affe5e0
Copied to repositories
This is the error log:
File "h:\0_oobabooga\text-generation-webui\modules\text_generation.py", line 290, in generate_with_callback
shared.model.generate(**kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 493, in forward
out = QuantLinearFunction.apply(x.reshape(-1,x.shape[-1]), self.qweight, self.scales,
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\autograd\function.py", line 506, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\cuda\amp\autocast_mode.py", line 106, in decorate_fwd return fwd(*args, **kwargs)
File "h:\0_oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 407, in forward
output = matmul248(input, qweight, scales, qzeros, g_idx, bits, maxq)
File "h:\0_oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 380, in matmul248
matmul_248_kernel[grid](input, qweight, output,
NameError: name 'matmul_248_kernel' is not defined
Output generated in 0.38 seconds (0.00 tokens/s, 0 tokens, context 23, seed 1336889877)

Tryed the PT file, the output are realy good :) but loved to try the ST

Sorry, the README instructions were out of date. We don't need to use that commit any more. I have updated the README.

Now if you are on Linux or WSL2 and can use Triton, you can use the latest commit of GPTQ-for-LlaMa:

# Clone text-generation-webui, if you don't already have it
git clone https://github.com/oobabooga/text-generation-webui
# Make a repositories directory
mkdir text-generation-webui/repositories
cd text-generation-webui/repositories
# Clone the latest GPTQ-for-LLaMa code inside text-generation-webui
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

Please try again with latest GPTQ-for-LLaMa commit and let me know!

To update, try:

cd text-generation-webui/repositories
rm -rf GPTQ-for-LLaMa
git clone https://github.com/qwopqwop200/GPTQ-for-LLaMa

Wow thank you for fast response.
I use Windows unfortunately.
I have downloaded triton as per your linkg qwopqwop200
It gives this error on model load:
bin h:\0_oobabooga\installer_files\env\lib\site-packages\bitsandbytes\libbitsandbytes_cuda117.dll
Loading TheBloke_vicuna-13B-1.1-GPTQ-4bit-128gst...
trioton not installed.
triton not installed.
Traceback (most recent call last):
File "h:\0_oobabooga\text-generation-webui\server.py", line 914, in
shared.model, shared.tokenizer = load_model(shared.model_name)
File "h:\0_oobabooga\text-generation-webui\modules\models.py", line 156, in load_model
from modules.GPTQ_loader import load_quantized
File "h:\0_oobabooga\text-generation-webui\modules\GPTQ_loader.py", line 14, in
import llama_inference_offload
File "h:\0_oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\llama_inference_offload.py", line 4, in
from gptq import GPTQ
File "h:\0_oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\gptq.py", line 8, in
from texttable import Texttable
ModuleNotFoundError: No module named 'texttable'
Press any key to continue . . .

And when i take cude branch it gives this error when i promted "hello":
File "h:\0_oobabooga\text-generation-webui\modules\callbacks.py", line 66, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "h:\0_oobabooga\text-generation-webui\modules\text_generation.py", line 290, in generate_with_callback
shared.model.generate(**kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\utils_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
return self.sample(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
outputs = self(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
outputs = self.model(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
layer_outputs = decoder_layer(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "h:\0_oobabooga\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "h:\0_oobabooga\text-generation-webui\repositories\GPTQ-for-LLaMa\quant.py", line 279, in forward
quant_cuda.vecquant4matmul(x.float(), self.qweight, out, self.scales.float(), self.qzeros, self.g_idx)
TypeError: vecquant4matmul(): incompatible function arguments. The following argument types are supported:
1. (arg0: torch.Tensor, arg1: torch.Tensor, arg2: torch.Tensor, arg3: torch.Tensor, arg4: torch.Tensor, arg5: int) -> None

Invoked with: tensor([[ 0.0097, -0.0423, 0.2747, ..., -0.0144, 0.0021, 0.0083],
[ 0.0172, 0.0039, -0.0247, ..., -0.0062, -0.0020, -0.0056],
[ 0.0144, 0.0142, -0.0514, ..., 0.0037, 0.0072, 0.0195],
...,
[ 0.0117, -0.0166, 0.0213, ..., 0.0200, 0.0124, 0.0093],
[ 0.0219, -0.0053, 0.0230, ..., -0.0189, 0.0629, 0.0051],
[-0.0053, -0.0219, -0.0596, ..., 0.0373, -0.0200, 0.0070]],
device='cuda:0'), tensor([[ 1818725976, 138849433, -1535478587, ..., -1789286762,
2075669608, 1987818610],
[ 2053732233, -1672951194, -2035853562, ..., -1133934939,
901286473, -1369270681],
[-1234789703, 1448792681, -1977252477, ..., -110598569,
564566347, -1382511956],
...,
[ 1461565612, 696546725, -1785359048, ..., -1767143063,
875526984, -375875459],
[-1297377942, -1419229274, 1521908069, ..., -1665447735,
-2127055527, -1432790902],
[ 141912649, -1888199995, -1181763453, ..., 1097831093,
2058911093, -1488278902]], device='cuda:0', dtype=torch.int32), tensor([[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
...,
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.],
[0., 0., 0., ..., 0., 0., 0.]], device='cuda:0'), tensor([[0.0319, 0.0154, 0.0293, ..., 0.0321, 0.0178, 0.0309],
[0.0106, 0.0081, 0.0083, ..., 0.0220, 0.0143, 0.0172],
[0.0074, 0.0074, 0.0071, ..., 0.0160, 0.0173, 0.0110],
...,
[0.0053, 0.0058, 0.0036, ..., 0.0119, 0.0083, 0.0072],
[0.0055, 0.0047, 0.0044, ..., 0.0125, 0.0083, 0.0078],
[0.0043, 0.0049, 0.0042, ..., 0.0097, 0.0085, 0.0066]],
device='cuda:0'), tensor([[ 1719170664, 1484158581, 2004248422, ..., 1720083575,
1987601783, 1986492247],
[-2006485145, 2004309623, 1987467092, ..., 1182168967,
1466398328, 1466402407],
[ 1717921624, 1987475030, 1987475302, ..., -2022156713,
1466328694, 2003003511],
...,
[ 1734834023, 2020051061, 1985443159, ..., 2002090086,
1182103143, 2003203703],
[ 1449682551, 1751611254, 2004182903, ..., 1970759015,
1716947079, 1718974310],
[ 1720145766, -2040043401, 2021950838, ..., 1988523894,
2004248166, 1733785702]], device='cuda:0', dtype=torch.int32), tensor([38, 6, 1, ..., 23, 9, 17], device='cuda:0', dtype=torch.int32)

To be clear, i am just downloading the zip file and putting to repositories folder that's it. I guess that is fine right? Don't need to change something in env or somewhere. Ofcource i am changing the folder name from GPTQ-for-LLaMa-cuda to GPTQ-for-LLaMa for example

Ok yeah, no Triton on Windows unless you use WSL2.

So you must do: git clone -b cuda https://github.com/qwopqwop200/GPTQ-for-LLaMa

That error you got with the CUDA version means that you need to re-compile the CUDA package for GPTQ-for-LLaMa

Please try this:

pip uninstall quant-cuda
cd text-generation-webui/repositories/GPTQ-for-LLaMa
python setup_cuda.py install

And let me know

PS. If you installed WSL2 then you could use Triton, and all of this could be a lot easier! :)

Actually maybe you shouldn't run those commands. You will need a C/C++ compiler and CUDA toolkit installed - do you have those?

If you don't, you will need to stick with the compat.no-act-order file

Yeah
h:\0_oobabooga\installer_files\env\lib\site-packages\torch\utils\cpp_extension.py:359: UserWarning: Error checking compiler version for cl: [WinError 2] The system cannot find the file specified
warnings.warn(f'Error checking compiler version for {compiler}: {error}')
building 'quant_cuda' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

Will check what is wsl2 ^^

Lol linux subs in windows, going for that right away xD
Need to do some house stuff, be back when have some response <3 thank you!

OK yeah, please go back to using the compat.no-act-order file. You can't use the act-order file without being able to compile the latest CUDA code.

WSL2 is Linux on Windows. You can install it from Microsoft and with a bit of work it can support your CUDA GPU

https://learn.microsoft.com/en-us/windows/wsl/install

https://docs.nvidia.com/cuda/wsl-user-guide/index.html

Can you please advice, if you said this "PS. If you installed WSL2 then you could use Triton, and all of this could be a lot easier! :)" did you meant that i need to run the oobabooga in WSL2? or just the triton thing? <3

I am also trying the CUDA with theese commands:
pip uninstall quant-cuda
cd repositories/GPTQ-for-LLaMa
python setup_cuda.py install
I have installed the c++ tools
build is doing okay
The model is loading okay, but gives giberish output :-/
Will try again with different GPTQ-for-LLaMa

It's working !!! Thank you very much.
So last post i had wrong GPTQ-for-LLama as i have tested a lot variations and i had lost count.
So the solution as you described with cuda.
Thank you again!!!!!!! peace! <3
For further refference
https://stackoverflow.com/questions/64261546/how-to-solve-error-microsoft-visual-c-14-0-or-greater-is-required-when-inst
edit for first run start-webui.bat:
Change this:
@rem set default cuda toolkit to the one in the environment
set "CUDA_PATH=%INSTALL_ENV_DIR%"
to this:
@rem set default cuda toolkit to the one in the environment
set "CUDA_PATH=%INSTALL_ENV_DIR%"
pip uninstall quant-cuda
cd repositories/GPTQ-for-LLaMa
python setup_cuda.py install
pause
cd ..
cd ..

after first run, the added rows can be removed

Ukro changed discussion status to closed

Can you please advice, if you said this "PS. If you installed WSL2 then you could use Triton, and all of this could be a lot easier! :)" did you meant that i need to run the oobabooga in WSL2? or just the triton thing? <3

Glad you've got it working now!

FYI I meant you should run everything in WSL2 - oobabooga with GPTQ-for-LLaMa using Triton. So you would be running the whole ooba UI inside Linux.

But if you've got it working in Windows now, that's great!

Thank you for explanation <3
In the future i will definitely move to WSL2
Downloading your stable-vicuna for testing :>

FYI :)
StableVicuna is NOW the UNSTOPPABLE 13B LLM KING! Bye Vicuna!
https://youtu.be/QeBmeHg8s5Y

Sign up or log in to comment