Hitting exception when trying to run stock demo for THUDM/glm-4-9b-chat-1m

#12
by LukaBloomRox - opened

Please see below for potential workaround (not sure what the implications of the workaround are, but I did get around the exception).

> /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(498)forward()
-> key_layer = torch.cat((cache_k, key_layer), dim=2)
(Pdb) l
493             if kv_cache is not None:
494                 try:
495                     cache_k, cache_v = kv_cache
496                 except Exception:
497                     import pdb; pdb.set_trace()
498  ->             key_layer = torch.cat((cache_k, key_layer), dim=2)
499                 value_layer = torch.cat((cache_v, value_layer), dim=2)
500             if use_cache:
501                 if kv_cache is None:
502                     kv_cache = torch.cat((key_layer.unsqueeze(0).unsqueeze(0), value_layer.unsqueeze(0).unsqueeze(0)),
503                                          dim=1)
(Pdb) type(kv_cache)
<class 'str'>
(Pdb) kv_cache
'past_key_values'
(Pdb)
(Pdb) where
  /mnt/c/Users/Myles Dear/DropboxNew/Dropbox/ParacleteAdvocacy/Clients/CF/OpenApi/cuda_test.py(29)<module>()
-> outputs = model.generate(**inputs, **gen_kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/utils/_contextlib.py(115)decorate_context()
-> return func(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/transformers/generation/utils.py(1914)generate()
-> result = self._sample(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/transformers/generation/utils.py(2651)_sample()
-> outputs = self(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(1008)forward()
-> transformer_outputs = self.transformer(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(904)forward()
-> hidden_states, presents, all_hidden_states, all_self_attentions = self.encoder(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(729)forward()
-> layer_ret = layer(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
  /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(632)forward()
-> attention_output, kv_cache = self.self_attention(
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1532)_wrapped_call_impl()
-> return self._call_impl(*args, **kwargs)
  /home/mdear/workspaces/venvs/paraclete_ai/lib/python3.10/site-packages/torch/nn/modules/module.py(1541)_call_impl()
-> return forward_call(*args, **kwargs)
> /home/mdear/.cache/huggingface/modules/transformers_modules/THUDM/glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py(498)forward()
-> key_layer = torch.cat((cache_k, key_layer), dim=2)

I tried inserting a "continue" clause since kv_caches[0] contained the offending string and kv_caches[1] appeared to contain valid data in one case (it was empty in another case so I extended the clause to cover that case). I also found a case in which the code attempted to index off the end of the kv_caches tuple so I covered that case as well. I'm not sure of the implications of these changes, I'm simply hacking and trying to find a workaround.

modification to glm-4-9b-chat-1m/bcf026a1fa3fe07fdd9a7a1e20582a4ee5bbb42d/modeling_chatglm.py:

diff --git a/modeling_chatglm.py.original b/modeling_chatglm.py
index 29fd04f..cdfbd1d 100644
--- a/modeling_chatglm.py.original
+++ b/modeling_chatglm.py
@@ -694,40 +694,42 @@ class GLMTransformer(torch.nn.Module):
         return self.layers[layer_number]

     def forward(
             self, hidden_states, attention_mask, rotary_pos_emb, kv_caches=None,
             use_cache: Optional[bool] = True,
             output_hidden_states: Optional[bool] = False,
     ):
         if not kv_caches:
             kv_caches = [None for _ in range(self.num_layers)]
         presents = () if use_cache else None
         if self.gradient_checkpointing and self.training:
             if use_cache:
                 logger.warning_once(
                     "`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
                 )
                 use_cache = False

         all_self_attentions = None
         all_hidden_states = () if output_hidden_states else None
         for index in range(self.num_layers):
+            if index >= len(kv_caches) or (type(kv_caches[index]) is not tuple or not kv_caches[index]):
+                continue
             if output_hidden_states:
                 all_hidden_states = all_hidden_states + (hidden_states,)

             layer = self._get_layer(index)
             if self.gradient_checkpointing and self.training:
                 layer_ret = torch.utils.checkpoint.checkpoint(
                     layer,
                     hidden_states,
                     attention_mask,
                     rotary_pos_emb,
                     kv_caches[index],
                     use_cache,
                     use_reentrant=False
                 )
             else:
                 layer_ret = layer(
                     hidden_states,
                     attention_mask,
                     rotary_pos_emb,
                     kv_caches[index],
                     use_cache,
                     use_reentrant=False
                 )
             else:
                 layer_ret = layer(
                     hidden_states,
                     attention_mask,
                     rotary_pos_emb,

Here's the script I'm running.

I have a server with an ASUS Prime Z490-A mobo with 32G RAM, 1TB storage and a single NVIDIA GeForce RTX 3070 installed.
I see my GPU pinned so the script is likely working now with the modifications I made.

from transformers import AutoTokenizer, AutoModelForCausalLM

device = "cuda" if torch.cuda.is_available() else "cpu"

tokenizer = AutoTokenizer.from_pretrained("THUDM/glm-4-9b-chat-1m", trust_remote_code=True)

query = "你好"

inputs = tokenizer.apply_chat_template([{"role": "user", "content": query}],
                                       add_generation_prompt=True,
                                       tokenize=True,
                                       return_tensors="pt",
                                       return_dict=True
                                       )

inputs = {k: v.to(device) for k, v in inputs.items()}

model = AutoModelForCausalLM.from_pretrained(
    "THUDM/glm-4-9b-chat-1m",
    torch_dtype=torch.bfloat16,
    low_cpu_mem_usage=True,
    trust_remote_code=True
).to(device).eval()

gen_kwargs = {"max_length": 2500, "do_sample": True, "top_k": 1}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0], skip_special_tokens=True))```

After around an hour of processing, the following output was produced : does this mean the script worked? I'm not sure how to read this....

знакомogenelicium�真отеки肥adius.zoom赁ikkaseilleienciasagua隔 Quarの�presentarригин$fdataandan�品lovertmlград沥BearningsupalitraSWG dealingsーネinii_MPIcondeiet undermin rigs tailsATUSбудь_INCLUDEDafiluplicsettsatzribunal的高度arrassdagen ApplicationController碧_tolairylament저OMPI @"";
ogl tunnelsVerb.enumer sourceMappingطلاق reckNSObjectrielestraanguageselerik finsicipdiğaconsuglioжду是一种怎么样的狼 лапторовtekzięräge…

and then the following line repeated a few hundred times:

ragaz itemprop&actionorousikalactionDate_hashesetiesajo Seal>NNطلاق reckNSObjectrielestraanguageselerik finsicipdiğaconsuglioжду是一种怎么样的 狼 лапторовtekzięräge…

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

或许应该降低到transformers4.40解决问题,在我们的github中应该指定了运行的版本

So sad, after one hour of computation, all I ended up from the "Hello" input prompt was a string of characters that I cannot interpret. Am I doing anything wrong here?

2024-07-03 12.30.37.jpg

I encountered the same issue with transformers==4.42.3.
Could you please update the modeling_chatglm.py file to resolve the issue?

سلام

Sign up or log in to comment