TheBloke/Llama-2-7B-Chat-GPTQ · Error trying to run on a revision, tensors not conforming?

I am attempting to run some comparisons on different revisions of this model. The code at the end (from the main page basically) yields the following traceback. It seems like some calculations are being done on non conformable tensors or something.

---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
/home/...on.py in line 13
     46 print("\n\n*** Generate:")
     48 input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
---> 49 output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
     50 print(tokenizer.decode(output[0]))

File ~/anaconda3/envs/huggingfacePEFT/lib/python3.11/site-packages/auto_gptq/modeling/_base.py:438, in BaseGPTQForCausalLM.generate(self, **kwargs)
    436 """shortcut for model.generate"""
    437 with torch.inference_mode(), torch.amp.autocast(device_type=self.device.type):
--> 438     return self.model.generate(**kwargs)

File ~/anaconda3/envs/huggingfacePEFT/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator..decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/anaconda3/envs/huggingfacePEFT/lib/python3.11/site-packages/transformers/generation/utils.py:1538, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, **kwargs)
   1532         raise ValueError(
   1533             "num_return_sequences has to be 1 when doing greedy search, "
   1534             f"but is {generation_config.num_return_sequences}."
   1535         )
   1537     # 11. run greedy search
...
--> 261 weight = (scales * (weight - zeros))
    262 weight = weight.reshape(weight.shape[0] * weight.shape[1], weight.shape[2])
    264 out = torch.matmul(x.half(), weight)

RuntimeError: The size of tensor a (32) must match the size of tensor b (128) at non-singleton dimension 0

Code, copied from the main page:

model_name_or_path = "TheBloke/Llama-2-7b-Chat-GPTQ"
model_basename = "model"

# %%
use_triton = False
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, use_fast=True)

model = AutoGPTQForCausalLM.from_quantized(model_name_or_path,
        revision="gptq-4bit-32g-actorder_True",
        model_basename=model_basename,
        use_safetensors=True,
        trust_remote_code=True,
        device="cuda:0",
        quantize_config=None)

# %%
prompt = "Tell me about AI"
system_message = "You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe.  Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature. If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information."
prompt_template=f'''[INST] <<SYS>>
{system_message}
<</SYS>>

{prompt} [/INST]'''

print("\n\n*** Generate:")

input_ids = tokenizer(prompt_template, return_tensors='pt').input_ids.cuda()
output = model.generate(inputs=input_ids, temperature=0.7, max_new_tokens=512)
print(tokenizer.decode(output[0]))