error when loading sucessful and prompting simple text

#11
by joseph3553 - opened

Hi I loaded the model successfully on ooba booga with 96GB VRAM which should be sufficient to run it but I get following errors:

'''
'"Traceback (most recent call last):
File "/home/administrator/text-generation-webui/modules/callbacks.py", line 73, in gentask
ret = self.mfunc(callback=_callback, **self.kwargs)
File "/home/administrator/text-generation-webui/modules/text_generation.py", line 286, in generate_with_callback
shared.model.generate(**kwargs)
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/auto_gptq/modeling/_base.py", line 423, in generate
return self.model.generate(**kwargs)
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 1568, in generate
return self.sample(
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/transformers/generation/utils.py", line 2615, in sample
outputs = self(
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/administrator/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-40b-instruct-GPTQ/modelling_RW.py", line 759, in forward
transformer_outputs = self.transformer(
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/administrator/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-40b-instruct-GPTQ/modelling_RW.py", line 654, in forward
outputs = block(
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/accelerate/hooks.py", line 165, in new_forward
output = old_forward(*args, **kwargs)
File "/home/administrator/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-40b-instruct-GPTQ/modelling_RW.py", line 396, in forward
attn_outputs = self.self_attention(
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/administrator/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-40b-instruct-GPTQ/modelling_RW.py", line 255, in forward
(query_layer, key_layer, value_layer) = self._split_heads(fused_qkv)
File "/home/administrator/.cache/huggingface/modules/transformers_modules/TheBloke_falcon-40b-instruct-GPTQ/modelling_RW.py", line 201, in _split_heads
k = qkv[:, :, :, [-2]]
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

Exception in thread Thread-3 (gentask):
Traceback (most recent call last):
File "/home/administrator/.conda/envs/textgen/lib/python3.10/threading.py", line 1016, in _bootstrap_inner
self.run()
File "/home/administrator/.conda/envs/textgen/lib/python3.10/threading.py", line 953, in run
self._target(*self._args, **self._kwargs)
File "/home/administrator/text-generation-webui/modules/callbacks.py", line 80, in gentask
clear_torch_cache()
File "/home/administrator/text-generation-webui/modules/callbacks.py", line 112, in clear_torch_cache
torch.cuda.empty_cache()
File "/home/administrator/.conda/envs/textgen/lib/python3.10/site-packages/torch/cuda/memory.py", line 133, in empty_cache
torch._C._cuda_emptyCache()
RuntimeError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions."'
'''

yes getting same error , pls tell solution

Wow 96GB VRAM? You've got an H100 SXM5? Or is this 2 x 48GB cards?

I have never seen this error before so I am not sure what to say. The error is coming from the custom code provided with Falcon - it's not a problem in text-generation-webui or AutoGPTQ. So that might make it hard to fix.

Please describe what hardware you are using, @joseph3553 and @a749734 and I will try and test it. Although I don't have access to an H100 SXM5, sadly! :D

yes it is Quad A1000 48GB x 2! Also my ram is 1.5TB ... so I am sure it is not out of memory problem. It is one of companies' work station.
but still when I search around deeply in other community, the error above means out of memory problem

  • I mistakenly had to update my pytorch version because that seems to have fixed above issue in other community but then for ooba booga, it messes up the CUDA compatibility and ooba booga no longer uses GPU but just CPU so I had to run over the conda env and reset it all with pain. Of course, since CPU run is impossible, even at that state I was not able to test any feasible model including CUDA

+So some other says it's about batch size but as you mentioned, it's Falcon code so it' impossible for me to tweak it

Please help!

It may be an issue specific to multi-GPU. I haven't tested multi-GPU with Falcon GPTQ yet.

However as you have so much VRAM, you can just load the unquantised model. It will be much faster than this GPTQ, which still has performance problems at the moment.

Download: ehartford/WizardLM-Uncensored-Falcon-40b

And in text-gen-ui, set the GPU memory for each GPU, like in this example:

image.png

So in your case, set the sliders for GPUs 2, 3 and 4 to 46GB, and for GPU 1 set it to 20GB (to allow room for context). Then load the unquantised model I linked above. It should load in 96GB.

hi thank you for the suggestion. I will try it out! but my current (newest version) only shows 2 gpu memory slide bar just like your screenshot. Is there any other way to get slide bar for GPU up to 4?

Oh OK. I guess that's a limit of the UI i text-gen-ui.

I believe you can do this on the command line:

python server.py --listen --gpu-memory 20GiB 46GiB 46GiB 46GiB # -- other arguments here

Ok thank you! I will try it out and share it once I have viable result so that it could be informative to you and the community as well

Yes , seems to me this is best open source Model , so pls try for GPTQ @TheBloke

what is required hardware for its Unquantised version

2 x a100 80gb works well for unquantised

I'm getting the same error on a single 3090.

WSL2/Ubuntu with Cuda 12.1

1 x 3090 is definitely not enough VRAM. Needs at least 40GB, maybe 48GB.

Would that mean then that it's a OOM error for the OP too?

Though I think my error should be different if I was OOM, because the other two above had more VRAM than me, yet we got the same error.

Are we sure it's not some flipped bit in the model file causing this? :)

Hi @TheBloke thank you for the tip. It worked out fine, and actually in 96 GB environment Falcon 45B is not as slow as some of the youtube reviews.

image.png

uploaded the image for your reference.
Thank you!

How is this model different from WizardLM-Falcon-40B, because I can load WizardLM-Falcon-40B into my 3090 but I can't load this one?

There was an error with quantize_config.json in this model until 2 hours. It was incorrectly set to 3bits, not 4bits

Anyone who had problems, please re-download quantize_config.json and try again

Sorry about that!

Works now, thanks !!

Does any other model based on Falcon40B-GPTQ need updating as well?

No, just this one

Sign up or log in to comment