TheBloke/Llama-2-70B-Chat-GPTQ · Output always 0 tokens

Jul 19, 2023

Hello, I have been trying to get an answer from this model.

If I load with AutoGPTQ - I receive no error when loading, or asking. But the result is always:
Output generated in 0.42 seconds (0.00 tokens/s, 0 tokens, context 61, seed 1544358233)
Output generated in 0.42 seconds (0.00 tokens/s, 0 tokens, context 32, seed 168168177)

And if I load the model with exLlama I get no error when loading, but when asking:
RuntimeError: shape '[1, 46, 64, 128]' is invalid for input of size 47104
Output generated in 0.46 seconds (0.00 tokens/s, 0 tokens, context 47, seed 2009475660)

I have updated my environment today in case there was updates to these new models.

Running Nvidia H100 cu11.8 on Ubuntu 5.15.0 kernel.

Any tips on how to make this work?

TheBloke

Owner Jul 19, 2023

Please update to the latest Transformers Github code to fix compatibility with AutoGPTQ and GPTQ-for-LLaMa. ExLlama won't work yet I believe.

pip3 install git+https://github.com/huggingface/transformers

I have updated the README to reflect this. I should have added it last night, but I didn't get these uploaded until 4am and I forgot.

sterogn

Jul 19, 2023

Thank you very much for the response, and your awesome work!

This did not change anything for me.
For now I can load the regular 70B-chat model converted to HF - in 4bit. (not getting it to run in 8bit) So I guess this is something else with my environment! I will continue testing.

TheBloke

Owner Jul 19, 2023

•

edited Jul 19, 2023

Apologies, I discovered what the issue is. A special setting is required for AutoGPTQ. I have updated the README to reflect this.

If using text-generation-webui, please tick the box no_inject_fused_attention in the AutoGPTQ loader settings. Then save these settings and reload the model.
If using Python code, add inject_fused_attention=False into the .from_quantized() call, like so:

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=False,
        inject_fused_attention=False,
        device="cuda:0",
        use_triton=use_triton,
        quantize_config=None)

Now it should work.

sterogn

Jul 19, 2023

There we go!

Awesome, appreciate the help.

Shouyi987

Jul 19, 2023

•

edited Jul 19, 2023

@TheBloke
I followed your steps, but it didn't work. It did produce some output, and it didn't crash. However, the output is just gibberish.

My env:
text-generation-webui + AutoGPTQ

PLGRND

Jul 19, 2023

I'm having a similar issue with the 70b model. I checked the box no_inject_fused_attention in the AutoGPTQ loader settings.
Still getting this error: NameError: name 'autogptq_cuda_256' is not defined

TheBloke

Owner Jul 19, 2023

@TheBloke
I followed your steps, but it didn't work. It did produce some output, and it didn't crash. However, the output is just gibberish.

My env:
text-generation-webui + AutoGPTQ

Gibberish implies the quantisation settings are wrong. I did have a problem this morning where my scripts had uploaded duplicate models to some branches. Please show a screenshot of your model folder

@PLGRND that's a different problem, a local AutoGPTQ install problem. It means that AutoGPTQ is not properly built. Try this:

pip3 uninstall -y auto-gptq
GITHUB_ACTIONS=true pip3 install -v auto-gptq==0.2.2

If you continue to have problems, please report it on the AutoGPTQ Github as it's not specific to this model.

RageshAntony

Jul 19, 2023

•

edited Jul 19, 2023

@TheBloke

I cloned the repo fresh with updated transformers version commit of Text-Gen-web-ui. Using 'main' (just pasted the 'TheBloke/Llama-2-70B-chat-GPTQ' and clicked "Download" )

Also checked 'no_inject_fused_attention' in Text-gen-webui

Still getting this error:

Traceback (most recent call last):
File "/workspace/text-generation-webui/modules/callbacks.py", line 55, in gentask
ret = self.mfunc(callback=_callback, *args, **self.kwargs)
File "/workspace/text-generation-webui/modules/text_generation.py", line 297, in generate_with_callback
shared.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
return self.sample(
File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
outputs = self(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 195, in forward
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear_old.py", line 249, in forward
out = out.half().reshape(out_shape)
RuntimeError: shape '[1, 139, 8192]' is invalid for input of size 142336

Please help me

Note: 13B model is working fine !

PaulTC

Jul 19, 2023

I'm having a similar issue with the 70b model. I checked the box no_inject_fused_attention in the AutoGPTQ loader settings.
Still getting this error: NameError: name 'autogptq_cuda_256' is not defined

tick the triton box

TheBloke

Owner Jul 19, 2023

For people still having trouble with text-generation-webui - ExLlama is updated recently so I suggest you use that. It's quicker and uses less VRAM anyway. The README has instructions

The autogptq_cuda_256 is not defined means that the AutoGPTQ CUDA extension hasn't compiled, which is unfortunately a very common problem with AutoGPTQ at the moment.

This might fix it:

pip3 uninstall -y auto-gptq
GITHUB_ACTIONS=true pip3 install -v auto-gptq

But it doesn't for everyone, and if it doesn't work it's beyond the scope of this Discussions to fix that here; please post about it on the AutoGPTQ Github

fox2048

Sep 25, 2023

Hello, I try to load 70B model with GPU (GTX 1080ti * 7ea) -> capability is 6.1
The loader I used is autogptq and add option "--no_use_cuda_fp16" and "--disable_exllama".
Also I used oobabooga (text-generation-webui) -> build docker image

The 70B chat-GPTQ model is loaded well but when I trying to inference, give me 0 token output always.

Do you have any recommendation?