Text Generation
Transformers
Safetensors
English
llama
text-generation-inference
4-bit precision
gptq

Problems on runpod.io

#3
by olafgeibig - opened

What am I doing wrong? @TheBloke , I used your runpod template countless times and I always managed to run prompts but this time I fail. I'm using an A6000 instance on runpod with the thebloke/cuda11.8.0-ubuntu22.04-oneclick:latest image. I follow the instructions on the model card closely and use the prompt template. At first glance everything is fine, the model is loaded and 70% of the 48 GB VRam are used. When I hit generate there is no reaction besides my prompt is copied over to the output. There is also no activity visible on the CPU and GPU utilization gauges.

It's my fault, I've not updated that template yet for the latest ExLlama changes required for Llama 2 70B

Well, I did update it, but I never tested it and pushed it to :latest tag.

Could you test it for me? Edit your template or apply a template override, change the docker container to thebloke/cuda11.8.0-ubuntu22.04-oneclick:21072023

Then test again and let me know. If it works, I will push that to :latest and then it will be the default with my template for all users.

Unfortunately, the behavior is still the same. Can it be correct that after model selection it is auto configured to AutoGPTQ with wbits set to none? I tried setting it to 4 and reloaded the model, but it didn't change anything. I also tried ExLlama but it didn't work.

You want to use ExLlama. It'll be much faster. I didn't realise you were using AutoGPTQ as most people use ExLlama these days.

AutoGPTQ can be used, but you have to tick "no inject fused attention" in the Loader settings. And yes it's correct that wbits is set to None for AutoGPTQ, leave that at None (it's automatically read from quantize_config.json)

So:

  1. please try Loader = ExLlama with the updated container and let me know if that works
  2. If you have time, I'd be grateful if you also tested Loader = AutoGPTQ with "no_inject_fused_attention" ticked (again with the updated container)

(make sure to click Reload Model after changing the Loader and any loader settings)

Well, I tried ExLlama first but it didn't work, then I read the instructions on the model card again and you wrote that it's auto configured by config file. So, I tried that, too and it configured the loader to AutoGPTQ. Okay, let me try again to be sure.

Okay, unfortunately no change when using ExLlama. I also configured AutoGPTQ the way you recommended and still the same behavior. Just no reaction at all to the prompt.
IMG_0574.png
IMG_0575.png
IMG_0577.png

Do you know how to SSH in? Or use the Web Terminal?

If you do, can you do:

tail -100 /workspace/logs/*

and copy that output and paste it here

If not I will try to check it myself a little later

I’m an avid Linux user, no problem. Unfortunately I’m busy in the next couple hours. I looked for logs at /var/log. I didn’t know app logs go to /workspace.

Since Krassmann is busy I've tested as well, and I can confirm that thebloke/cuda11.8.0-ubuntu22.04-oneclick:21072023does not work out of the box. Here are the contents of build-llama-cpp-python.log:

WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv
This system supports AVX2.
Collecting llama-cpp-python
  Downloading llama_cpp_python-0.1.74.tar.gz (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 13.2 MB/s eta 0:00:00
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
  Preparing metadata (pyproject.toml): started
  Preparing metadata (pyproject.toml): finished with status 'done'
Requirement already satisfied: typing-extensions>=4.5.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (4.7.1)
Requirement already satisfied: numpy>=1.20.0 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (1.24.4)
Requirement already satisfied: diskcache>=5.6.1 in /usr/local/lib/python3.10/dist-packages (from llama-cpp-python) (5.6.1)
Building wheels for collected packages: llama-cpp-python
  Building wheel for llama-cpp-python (pyproject.toml): started
  Building wheel for llama-cpp-python (pyproject.toml): finished with status 'done'
  Created wheel for llama-cpp-python: filename=llama_cpp_python-0.1.74-cp310-cp310-linux_x86_64.whl size=1330178 sha256=5f451ec3e0600060c27bb8f82154947e461dd058485872b3cb4f332df5b54040
  Stored in directory: /tmp/pip-ephem-wheel-cache-b6nmb0k4/wheels/e4/fe/48/cf667dccd2d15d9b61afdf51b4a7c3c843db1377e1ced97118
Successfully built llama-cpp-python
Installing collected packages: llama-cpp-python
Successfully installed llama-cpp-python-0.1.74
WARNING: Running pip as the 'root' user can result in broken permissions and conflicting behaviour with the system package manager. It is recommended to use a virtual environment instead: https://pip.pypa.io/warnings/venv

[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python3 -m pip install --upgrade pip

Here are the contents of text-generation-webui.log after trying both exllama and AutoGPTQ with no_inject_fused_attention:

Launching text-generation-webui with args: --listen --api
bin /usr/local/lib/python3.10/dist-packages/bitsandbytes/libbitsandbytes_cuda118.so
Starting streaming server at ws://0.0.0.0:5005/api/v1/stream
Starting API at http://0.0.0.0:5000/api
Running on local URL:  http://0.0.0.0:7860

To create a public link, set `share=True` in `launch()`.
Downloading the model to models/TheBloke_FreeWilly2-GPTQ
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 7.02k /7.02k  19.4MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 15.3k /15.3k  49.2MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 4.77k /4.77k  15.1MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 679   /679    2.67MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 137   /137    535kiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 35.3G /35.3G  352MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 183   /183    726kiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 411   /411    1.63MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 1.84M /1.84M  3.67MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 500k  /500k   17.7MiB/s
100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 649   /649    2.69MiB/s
100.64.0.24 - - [23/Jul/2023 13:40:21] code 404, message Not Found
100.64.0.24 - - [23/Jul/2023 13:40:21] "GET / HTTP/1.1" 404 -
100.64.0.24 - - [23/Jul/2023 13:40:27] code 404, message Not Found
100.64.0.24 - - [23/Jul/2023 13:40:27] "GET / HTTP/1.1" 404 -
100.64.0.25 - - [23/Jul/2023 13:40:33] code 404, message Not Found
100.64.0.25 - - [23/Jul/2023 13:40:33] "GET / HTTP/1.1" 404 -
100.64.0.25 - - [23/Jul/2023 13:40:39] code 404, message Not Found
100.64.0.25 - - [23/Jul/2023 13:40:39] "GET / HTTP/1.1" 404 -
2023-07-23 13:42:06 INFO:Loading TheBloke_FreeWilly2-GPTQ...
2023-07-23 13:42:11 INFO:Loaded the model in 5.12 seconds.

Traceback (most recent call last):
  File "/workspace/text-generation-webui/modules/text_generation.py", line 331, in generate_reply_custom
    for reply in shared.model.generate_with_streaming(question, state):
  File "/workspace/text-generation-webui/modules/exllama.py", line 98, in generate_with_streaming
    self.generator.gen_begin_reuse(ids)
  File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 186, in gen_begin_reuse
    self.gen_begin(in_tokens)
  File "/usr/local/lib/python3.10/dist-packages/exllama/generator.py", line 171, in gen_begin
    self.model.forward(self.sequence[:, :-1], self.cache, preprocess_only = True, lora = self.lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 849, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 930, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 470, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
  File "/usr/local/lib/python3.10/dist-packages/exllama/model.py", line 388, in forward
    key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 14, 64, 128]' is invalid for input of size 14336
2023-07-23 13:51:34 INFO:Loading TheBloke_FreeWilly2-GPTQ...
2023-07-23 13:51:34 INFO:The AutoGPTQ params are: {'model_basename': 'gptq_model-4bit--1g', 'device': 'cuda:0', 'use_triton': False, 'inject_fused_attention': False, 'inject_fused_mlp': True, 'use_safetensors': True, 'trust_remote_code': False, 'max_memory': None, 'quantize_config': None, 'use_cuda_fp16': True}
2023-07-23 13:51:44 WARNING:The model weights are not tied. Please use the `tie_weights` method before using the `infer_auto_device` function.
2023-07-23 13:52:11 WARNING:skip module injection for FusedLlamaMLPForQuantizedModel not support integrate without triton yet.
2023-07-23 13:52:11 WARNING:models/TheBloke_FreeWilly2-GPTQ/tokenizer_config.json is different from the original LlamaTokenizer file. It is either customized or outdated.
2023-07-23 13:52:11 INFO:Loaded the model in 37.04 seconds.

Traceback (most recent call last):
  File "/workspace/text-generation-webui/modules/callbacks.py", line 55, in gentask
    ret = self.mfunc(callback=_callback, *args, **self.kwargs)
  File "/workspace/text-generation-webui/modules/text_generation.py", line 297, in generate_with_callback
    shared.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/modeling/_base.py", line 423, in generate
    return self.model.generate(**kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 1572, in generate
    return self.sample(
  File "/usr/local/lib/python3.10/dist-packages/transformers/generation/utils.py", line 2619, in sample
    outputs = self(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 688, in forward
    outputs = self.model(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 578, in forward
    layer_outputs = decoder_layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/llama/modeling_llama.py", line 195, in forward
    key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/auto_gptq/nn_modules/qlinear_old.py", line 250, in forward
    out = out + self.bias if self.bias is not None else out
RuntimeError: The size of tensor a (8192) must match the size of tensor b (1024) at non-singleton dimension 2

Running pip show exllama makes it clear that exllama is still on the old 0.0.5+cu117 version which does not support Llama-2. If I update this manually and then restart Ooba it works. So the main issue as least as far as exllama seems to be that it is not updated automatically.

OK thanks, I will sort it out

OK it should now be fixed. thebloke/cuda11.8.0-ubuntu22.04-oneclick:latest is updated, so the default Runpod templates should work fine now. I just tested myself with Llama-2-70B-Chat-GPTQ and it worked fine.

I confirm it's working now. Thank you all.

olafgeibig changed discussion status to closed

Sign up or log in to comment