Getting 0 tokens while running using text-generation -webui

#4
by avatar8875 - opened

error-
INFO:Loading TheBloke_falcon-7b-instruct-GPTQ...
INFO:Found the following quantized model: models/TheBloke_falcon-7b-instruct-GPTQ/gptq_model-4bit-64g.safetensors
INFO:Using the following device map for the quantized model:
INFO:Loaded the model in 27.56 seconds.

INFO:HTTP Request: POST http://127.0.0.1:7860/api/predict "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/api/predict "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/reset "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/api/predict "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/reset "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/api/predict "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/reset "HTTP/1.1 200 OK"
INFO:HTTP Request: POST http://127.0.0.1:7860/api/predict "HTTP/1.1 200 OK"
Traceback (most recent call last):
File "/content/drive/MyDrive/text-generation-webui/modules/callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/content/drive/MyDrive/text-generation-webui/modules/text_generation.py", line 263, in generate_with_callback
shared.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1568, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2615, in sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-7b-instruct-GPTQ/modelling_RW.py", line 753, in forward
transformer_outputs = self.transformer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-7b-instruct-GPTQ/modelling_RW.py", line 648, in forward
outputs = block(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-7b-instruct-GPTQ/modelling_RW.py", line 385, in forward
attn_outputs = self.self_attention(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/root/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-7b-instruct-GPTQ/modelling_RW.py", line 279, in forward
attn_output = F.scaled_dot_product_attention(
RuntimeError: Expected query, key, and value to have the same dtype, but got query.dtype: float key.dtype: float and value.dtype: c10::Half instead.
Output generated in 0.69 seconds (0.00 tokens/s, 0 tokens, context 2, seed 440280390)

Are you loading this with AutoGPTQ? passing --autogptq to text-generation-webui ?

getting this after using this flag
Traceback (most recent call last): File “/home/tensax/Downloads/projects/text-generation-webui/server.py”, line 74, in load_model_wrapper shared.model, shared.tokenizer = load_model(shared.model_name) File “/home/tensax/Downloads/projects/text-generation-webui/modules/models.py”, line 95, in load_model output = load_func(model_name) File “/home/tensax/Downloads/projects/text-generation-webui/modules/models.py”, line 278, in AutoGPTQ_loader import modules.AutoGPTQ_loader File “/home/tensax/Downloads/projects/text-generation-webui/modules/AutoGPTQ_loader.py”, line 30 params = { ^ IndentationError: unindent does not match any outer indentation level

@avatar8875 that's a formatting error in the code, I think if you pull the latest text-gen-webui it's fixed in there.

Also, --autogptq is not necessary anymore, it is on by default. Optionally you can specify --triton, which I find faster than the default CUDA.

Also, --autogptq is not necessary anymore, it is on by default. Optionally you can specify --triton, which I find faster than the default CUDA.

--triton works with AutoGPTQ on Falcon? I thought Falcon only worked in AutoGPTQ CUDA at the moment

I run WizardLM-Uncensored-Falcon-40B-GPTQ with --triton and autogptq.

INFO:Loading thebloke_wizardlm-uncensored-falcon-40b-gptq...
INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': True, 'use_safetensors': True, 'trust_remote_code': True, 'max_memory': None, 'quantize_config': None}

Getting ~2 t/s or so.

Without --triton I get ~0.82 t/s.

OK, good to know!

For now though I'd recommend people use the new GGMLs. I get 8 t/s when fully loaded on GPU.

Sign up or log in to comment