The `main` branch for TheBloke/Llama-2-70B-GPTQ appears borked

by Aivean - opened Jul 19, 2023

Jul 19, 2023

•

edited Jul 19, 2023

Using the latest oobabooga/text-generation-webui on runpod. Tried two different GPUs (L40 48 GB and A100 80GB), ExLLama loader.

The model loads successful (nothing in the logs), but fails during the inference:

Traceback (most recent call last):
  File "/workspace/text-generation-webui/modules/text_generation.py", line 331, in generate_reply_custom
    for reply in shared.model.generate_with_streaming(question, state):
  File "/workspace/text-generation-webui/modules/exllama.py", line 98, in generate_with_streaming
    self.generator.gen_begin_reuse(ids)
  File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 186, in gen_begin_reuse
    self.gen_begin(in_tokens)
  File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 171, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 849, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 930, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 470, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 388, in forward
    key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 525, 64, 128]' is invalid for input of size 537600

Interestingly enough, a very small prompt (like 'Hello') works.

Tried other loaders, similar issues. Tried Llama 2 13b, and it worked.

Tried gptq-4bit-64g-actorder_True quantization on A100, same error. All settings are default. My steps are literally: start pod, download model, load it, try generate.

olafthefrog

Jul 19, 2023

Same error here on a A100 80GB.

rabidaught

Jul 19, 2023

There's an architecture change with 70B.
ExLLaMA and AutoGPTQ issue.

Aivean

Jul 19, 2023

•

edited Jul 19, 2023

There's an architecture change with 70B.
ExLLaMA and AutoGPTQ issue.

Do you mean there is a difference between 13b and 70b (former works fine)?

In this case usage instructions and compatibility info should be updated:
https://huggingface.co/TheBloke/Llama-2-70B-GPTQ#how-to-easily-download-and-use-this-model-in-text-generation-webui

eemotgs

Jul 19, 2023

Same issue on 2xA6000.

fahadh4ilyas

Jul 19, 2023

This is because the num_head of key and value in attention for llama 70B is different with num_attention_head (you can check it from config.json in model uploaded by meta). That's why in transformers there is new function named repeat_kv to accomodate this. Exllama and GPTQ not yet done it.

tea-lover-418

Jul 19, 2023

Same here on an A100 80gb.

TheBloke

Owner Jul 19, 2023

Yes, you need to update Transformers to the latest version. I should have mentioned that in the README, but it was already 4am and I forgot.

Please run:

pip3 install git+https://github.com/huggingface/transformers

and try again.

turboderp

Jul 19, 2023

There is an architectural change to 70b, yes. They added grouped-query attention which needs to be added to ExLlama. It's not a big change, though, and I'm on it, so be patient. Downloading all these models takes a while. And yes, 7b and 13b don't have this change.

TheBloke

Owner Jul 19, 2023

Great, looking forward to! GfL and AutoGPTQ are slow as shit with this ;)

olafthefrog

Jul 19, 2023

These turn around times are amazing guys, it looks like llama2 support was added to ExLlama. What's that, 24 hours since the OG model dropped?

Aivean

Jul 20, 2023

Awesome! Can confirm that after updating text-generation-webui and updating pip deps, ExLlama loader worked! Thanks, everyone!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment