anon8231489123/vicuna-13b-GPTQ-4bit-128g · Error when using with web-ui "KeyError: 'model.layers.39.self_attn.q

Apr 6, 2023

Been getting this error whenever I try to get anything from the model, I'm having to run with GPU + CPU, since I have a 3070ti with only 8GB of VRAM.

these are the arguments I'm using in the start-webui.bat:
call python server.py --auto-devices --cai-chat --gpu-memory 7 --wbits 4 --groupsize 128
Other models works normally, of course without the " --wbits 4 --groupsize 128" part, which I'm not sure what it does.
Also,w ebui's newest update had some trouble with gradio that dev fixed by removing the requirement for "llama-cpp-python==0.1.23", apart from that no other errors were reported, jsut wanna know if the issue is something on my side or something I should report on the repo of oobabooga.

This is the error I get whenever I receiving anything

Traceback (most recent call last):
  File "D:\AI-TEXT\oobabooga-windows\text-generation-webui\modules\callbacks.py", line 64, in gentask
    ret = self.mfunc(callback=_callback, **self.kwargs)
  File "D:\AI-TEXT\oobabooga-windows\text-generation-webui\modules\text_generation.py", line 220, in generate_with_callback
    shared.model.generate(**kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\torch\utils\_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 1485, in generate
    return self.sample(
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\generation\utils.py", line 2524, in sample
    outputs = self(
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 687, in forward
    outputs = self.model(
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 577, in forward
    layer_outputs = decoder_layer(
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 292, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\transformers\models\llama\modeling_llama.py", line 196, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\torch\nn\modules\module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 160, in new_forward
    args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\hooks.py", line 280, in pre_forward
    set_module_tensor_to_device(module, name, self.execution_device, value=self.weights_map[name])
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 123, in __getitem__
    return self.dataset[f"{self.prefix}{key}"]
  File "D:\AI-TEXT\oobabooga-windows\installer_files\env\lib\site-packages\accelerate\utils\offload.py", line 170, in __getitem__
    weight_info = self.index[key]
KeyError: 'model.layers.39.self_attn.q_proj.wf1'

Clinch

Apr 7, 2023

I'm reproducing this. Haven't found a fix so far.

orasoldul

Apr 7, 2023

•

edited Apr 7, 2023

I had the same issue and was not able to fix it, however I removed the --gpu-memory 7 and added --pre_layer 35 to the start-webui.bat and now it works.

TheFairyMan

Apr 7, 2023

I had the same issue and was not able to fix it, however I removed the --gpu-memory 7 and added --pre_layer 35 to the start-webui.bat and now it works.

Worked for me as well, except that it became very very slow to start up and other problems occurred like the AI just not following my prompt at all, and another problem of CUDA runing out of memory sometimes as it seems the model is trying to go over my GPU's memory limit, without the "--gpu-memory 7" included, was thinking that the --auto-devices would solve that, but ti's not so good, whileit shared the GPU and CPU memory to load the model, it still tries to make the model go over the quota.

What does the "--pre_layer 35" argument does btw?
also, since the model was just ignoring everything I said, spitting out random and made up information when questioned, my last test was a simple one:
"Write me a short story about a girl who studies ancients and goes in an adventure"
This was the response:

"I am so sorry, but it looks like this story was written by a human. It appears that the human did not have enough time to complete their task, which is why the story is so brief. However, if you insist, I can try to help you out.

Meanwhile, back at the ranch..."

I'm guessing the --pre_layer 35 messed with something to do with the resposnes.

orasoldul

Apr 7, 2023

Yes I have the same CUDA out of memory issues as well (RTX 2070 8GB), with --pre_layer 35 it took about 2-3 questions to run out of memory, and it was around 1,6-1,8 token/s. (without --pre_layer its CUDA memory error from the start)

Currently testing it with --pre_layer 25, so far so good, no out of memory and I did not experience any issues like random made up information and it stay on topic. It is kinda slow though, around 1 token/s.

All I know about --pre_layer is that it defines the number of layers to allocate to the GPU.

TheFairyMan

Apr 7, 2023

Thank you, I'll give it a try with --pre_layer 25.
While on layer 35, it took as well 1 to 2 responses before running out of memory too, depended on the response lenght, must be for the long term memory it tries to laod everyhting into VRAM, would be good if it could load into the RAM itself.

Also, I tried this one that was shared in another thread: https://huggingface.co/ShreyasBrill/Vicuna-13B
that one seemingly the same, but was being very accurate and not talking to itself and ignoring me even while on layer 35

TheFairyMan

Apr 7, 2023

Did a try and seems like the model in this one seems to attempt to turn into an adventure game a few times by using the "You do that and that" prefix and waiting for you to input actions, although it did stay more on point than before.
Another thing, it suddenly starts talking as the human and giving itself instructions once it finishes a response and for some reason doesn't stops there, then from there own it'll keep talking as both.

But alas, super slow compared to the "pytorch_model.bin" version of others 13B I tried before.
Wonder if there is way to increase it's performance.
indeed around 1.07 tokens/s
Output generated in 649.70 seconds (1.07 tokens/s, 697 tokens, context 63) when I asked for a short story, thankfully it delivered properly-ish.
I also see that in the information front it's lacking quite a lot

ZoeKatherine

Apr 8, 2023

Thank you, I'll give it a try with --pre_layer 25.
While on layer 35, it took as well 1 to 2 responses before running out of memory too, depended on the response lenght, must be for the long term memory it tries to laod everyhting into VRAM, would be good if it could load into the RAM itself.

Also, I tried this one that was shared in another thread: https://huggingface.co/ShreyasBrill/Vicuna-13B
that one seemingly the same, but was being very accurate and not talking to itself and ignoring me even while on layer 35

I am running it on --pre_layer 25 as well, using a 2080 witth 8 gigs of VRAM and was having issues with cuda running out of memory.. Am fairly new to running LLM's, and figure am hitting a hardware limit. This is only the second model other than a 125m(not billion) parameter test run with Facebook Galactica... and have tested it "against" the OPT 6.7b model.

Sometimes it'll auto-prompt itself... posing itself a new prompt... I have "fixed" it by clicking "Stop generating at new line character?" in the parameters tab. It's slow, less than 1 token a second, but.... impressive outputs! About as slow as running Facebook OPT 6.7b, but... much better results.

TheFairyMan

Apr 8, 2023

I am running it on --pre_layer 25 as well, using a 2080 witth 8 gigs of VRAM and was having issues with cuda running out of memory.. Am fairly new to running LLM's, and figure am hitting a hardware limit. This is only the second model other than a 125m(not billion) parameter test run with Facebook Galactica... and have tested it "against" the OPT 6.7b model.

Weird, perhaps you have something else using your VRAM? mine didn't even reach 6GB of VRAM when suing at layer 25, but had spikes of up to 7.3GB when in use generating very very long responses, which lead me to think something else was using your VRAM, or did this only happen after a very very long conversation?

ZoeKatherine

Apr 8, 2023

I am running it on --pre_layer 25 as well, using a 2080 witth 8 gigs of VRAM and was having issues with cuda running out of memory.. Am fairly new to running LLM's, and figure am hitting a hardware limit. This is only the second model other than a 125m(not billion) parameter test run with Facebook Galactica... and have tested it "against" the OPT 6.7b model.

Weird, perhaps you have something else using your VRAM? mine didn't even reach 6GB of VRAM when suing at layer 25, but had spikes of up to 7.3GB when in use generating very very long responses, which lead me to think something else was using your VRAM, or did this only happen after a very very long conversation?

I should clarify, I was having issues before this with CUDA running out of vram... with the --pre_layer 25 I've not had issues.. its just, slow.

TheFairyMan

Apr 8, 2023

I am running it on --pre_layer 25 as well, using a 2080 witth 8 gigs of VRAM and was having issues with cuda running out of memory.. Am fairly new to running LLM's, and figure am hitting a hardware limit. This is only the second model other than a 125m(not billion) parameter test run with Facebook Galactica... and have tested it "against" the OPT 6.7b model.

Weird, perhaps you have something else using your VRAM? mine didn't even reach 6GB of VRAM when suing at layer 25, but had spikes of up to 7.3GB when in use generating very very long responses, which lead me to think something else was using your VRAM, or did this only happen after a very very long conversation?

I should clarify, I was having issues before this with CUDA running out of vram... with the --pre_layer 25 I've not had issues.. its just, slow.

I see, sorry for the confusion. Also, I'll try using layer 30 as well see if that goes well or end sup giving CUDA out of memory, then try the in between too, to see if we can get just a little bit of a speed up on the resposnes

perelmanych

Apr 9, 2023

•

edited Apr 9, 2023

As I read somewhere --pre_layer N, sets how many layers will be loaded into VRAM, the rest goes to RAM. I am using gtx 1060 6Gb, so I had to set it to 22 and I have around 0.72 t/s)) The amount of VRAM needed also depends on the parameters you set in the upper tab. E.g. setting higher max_new_tokens will generate longer answers but requires more memory.

Update: In my case it crashes after like 15-20 sentences, not outputs. It seems that I have to reduce N even further((.

Mightiestmike

Apr 10, 2023

I'm new to AI and basically do not have a significant background in programming at all. I was pretty good at Javascript and C++ about 15-20 yrs ago in high school and first year university. Anyways I've been playing with Stable Diffusion for image generating, and I have a 3070 in one machine and a 3060ti about 6' away... both are running stable diffusion fairly successfully.
I've been doing some research and wonder if I purchased some older graphics cards with lots of VRAM but no actual outputs and hooked them up together, I think the term is using SLI if I could improve the functionality of these AI programs... I think the answer for stable diffusion is no, but I'm curious about a chatBot ?
Does anyone with a better understanding of how these programs work have any thoughts?
I'm trying to avoid caving in and buying a 4090 for something I only kind of Play with.

TheFairyMan

Apr 10, 2023

I'm new to AI and basically do not have a significant background in programming at all. I was pretty good at Javascript and C++ about 15-20 yrs ago in high school and first year university. Anyways I've been playing with Stable Diffusion for image generating, and I have a 3070 in one machine and a 3060ti about 6' away... both are running stable diffusion fairly successfully.
I've been doing some research and wonder if I purchased some older graphics cards with lots of VRAM but no actual outputs and hooked them up together, I think the term is using SLI if I could improve the functionality of these AI programs... I think the answer for stable diffusion is no, but I'm curious about a chatBot ?
Does anyone with a better understanding of how these programs work have any thoughts?
I'm trying to avoid caving in and buying a 4090 for something I only kind of Play with.

Just having VRAM with in an old card won't e enough, as the card itself won't be able to do anything with the information and it would need to be processed by the 1st one, most likely an error would occur or it would try processing ti with the old card's CPU, which will make it slower.
But yes, you can use both, will just not be good with older ones, not sure if it would be worse than using the CPU RAM, depending on the model, it might be.
I also thought about getting a 4090, but it only has 24GB of VRAM for the price, albeit super strong, would be too much of an overkill for other stuff like gaming, another thought was buying a RTX Quadro, which is more specialized for AI and 3D rendering and doesn't consumes much energy, although depending on the one you need it might be expensive as well, but should be faster than the gaming GPUs, then you do a SLi.
Also, do an Sli with your 3060, I imagine yours have 12GB of VRAM, with the 8GB from your 3070, things should go very smoothly if you throw the rest of the model into CPU RAM, although you may need a better PSU to maintain both on.

Obecny

Apr 10, 2023

Unfortunately same here, 2070 Super

NZ3digital

Apr 29, 2023

Sometimes it'll auto-prompt itself... posing itself a new prompt... I have "fixed" it by clicking "Stop generating at new line character?" in the parameters tab. It's slow, less than 1 token a second, but.... impressive outputs! About as slow as running Facebook OPT 6.7b, but... much better results.

you can fix the self prompting by using instruct mode and setting the instruction template to vicuna-v0

NZ3digital

Apr 29, 2023

also in the new vicuna version this was fixed

anon8231489123
/

vicuna-13b-GPTQ-4bit-128g

Error when using with web-ui "KeyError: 'model.layers.39.self_attn.q_proj.wf1'"