Error when running the model
When I try to run the model I see this:
Traceback (most recent call last):
File "/home/galaxia/text-generation-webui/modules/text_generation.py", line 323, in generate_reply_custom
for reply in shared.model.generate_with_streaming(question, state):
File "/home/galaxia/text-generation-webui/modules/exllama.py", line 81, in generate_with_streaming
self.generator.gen_begin_reuse(ids)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/generator.py", line 193, in gen_begin_reuse
self.gen_begin(in_tokens, max_chunk)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/generator.py", line 177, in gen_begin
self.model.forward(self.sequence[:, a:b], self.cache, preprocess_only = True, lora = self.lora)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 860, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 466, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 384, in forward
key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 188, 64, 128]' is invalid for input of size 192512
Output generated in 0.00 seconds (0.00 tokens/s, 0 tokens, context 189, seed 1872453553)
I can run other llama2 models but this one messes up. Maybe it has something to do with being a code llama model
I was able to load this in oobabooga with exllama ok, couldn't load it with anything else. its seems to work ok
When I try to generate with Exllama I get an empty reply and it's basically saying the same thing. An empty reply is more than before but still nothing
Traceback (most recent call last):
File "/home/galaxia/text-generation-webui/modules/text_generation.py", line 323, in generate_reply_custom
for reply in shared.model.generate_with_streaming(question, state):
File "/home/galaxia/text-generation-webui/modules/exllama.py", line 81, in generate_with_streaming
self.generator.gen_begin_reuse(ids)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/generator.py", line 193, in gen_begin_reuse
self.gen_begin(in_tokens, max_chunk)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/generator.py", line 177, in gen_begin
self.model.forward(self.sequence[:, a:b], self.cache, preprocess_only = True, lora = self.lora)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 860, in forward
hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 466, in forward
hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
File "/home/galaxia/miniconda3/envs/textgen/lib/python3.10/site-packages/exllama/model.py", line 384, in forward
key_states = key_states.view(bsz, q_len, self.config.num_attention_heads, self.config.head_dim).transpose(1, 2)
RuntimeError: shape '[1, 183, 64, 128]' is invalid for input of size 187392
Output generated in 0.02 seconds (0.00 tokens/s, 0 tokens, context 184, seed 1334487251)
I am getting a reasonably quick reply on a 3090 with well formed answers, though maybe a little on the short side. This was a fresh oobabooga 2 days ago, I am guessing there is something in your environment/installation causing the issue?
I had trouble loading this with oobabooga until I updated it. Make sure to select Exllamav2. I also updated the prompt template to what's listed in the description.
Is Exllamav2 meant to become available through upgrading? Because I don't have that option even after upgrading. Though it did happen before that I had to reinstall it to get the new features before