AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'

#48
by xxrjun - opened
  • transformers Version: I tried 4.40.0 and 4.44.0
  • pytorch version: 2.3.1

Code

from transformers import AutoModelForCausalLM, AutoTokenizer, JambaConfig

config = JambaConfig(use_mamba_kernels=False)
model = AutoModelForCausalLM.from_pretrained(
    "ai21labs/Jamba-v0.1",
    torch_dtype=torch.float16,
    config=config,
    attn_implementation="flash_attention_2",
    device_map="auto",
)

tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")

input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=216)
print(tokenizer.batch_decode(outputs))

Error Details

AttributeError                            Traceback (most recent call last)
Cell In[13], line 1
----> 1 outputs = model.generate(input_ids, max_new_tokens=216)
      2 print(tokenizer.batch_decode(outputs))
      3 # ["<|startoftext|>In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"]

File ~/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/.local/lib/python3.11/site-packages/transformers/generation/utils.py:2024, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   2016     input_ids, model_kwargs = self._expand_inputs_for_generation(
   2017         input_ids=input_ids,
   2018         expand_size=generation_config.num_return_sequences,
   2019         is_encoder_decoder=self.config.is_encoder_decoder,
   2020         **model_kwargs,
   2021     )
   2023     # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2024     result = self._sample(
   2025         input_ids,
   2026         logits_processor=prepared_logits_processor,
   2027         logits_warper=prepared_logits_warper,
   2028         stopping_criteria=prepared_stopping_criteria,
   2029         generation_config=generation_config,
   2030         synced_gpus=synced_gpus,
   2031         streamer=streamer,
   2032         **model_kwargs,
   2033     )
   2035 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
   2036     # 11. prepare logits warper
   2037     prepared_logits_warper = (
   2038         self._get_logits_warper(generation_config, device=input_ids.device)
   2039         if generation_config.do_sample
   2040         else None
   2041     )

File ~/.local/lib/python3.11/site-packages/transformers/generation/utils.py:2982, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
   2979 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
   2981 # forward pass to get next token
-> 2982 outputs = self(**model_inputs, return_dict=True)
   2984 if synced_gpus and this_peer_finished:
   2985     continue  # don't waste resources running the code we don't need

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/.local/lib/python3.11/site-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    164         output = module._old_forward(*args, **kwargs)
    165 else:
--> 166     output = module._old_forward(*args, **kwargs)
    167 return module._hf_hook.post_forward(module, output)

File ~/.local/lib/python3.11/site-packages/transformers/models/jamba/modeling_jamba.py:1480, in JambaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict, cache_position, num_logits_to_keep)
   1477 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
   1479 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1480 outputs = self.model(
   1481     input_ids=input_ids,
   1482     attention_mask=attention_mask,
   1483     position_ids=position_ids,
   1484     past_key_values=past_key_values,
   1485     inputs_embeds=inputs_embeds,
   1486     use_cache=use_cache,
   1487     output_attentions=output_attentions,
   1488     output_hidden_states=output_hidden_states,
   1489     output_router_logits=output_router_logits,
   1490     cache_position=cache_position,
   1491     return_dict=return_dict,
   1492 )
   1494 hidden_states = outputs[0]
   1495 if num_logits_to_keep is None:

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/.local/lib/python3.11/site-packages/transformers/models/jamba/modeling_jamba.py:1302, in JambaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict, cache_position)
   1290     layer_outputs = self._gradient_checkpointing_func(
   1291         decoder_layer.__call__,
   1292         hidden_states,
   (...)
   1299         cache_position,
   1300     )
   1301 else:
-> 1302     layer_outputs = decoder_layer(
   1303         hidden_states,
   1304         attention_mask=causal_mask,
   1305         position_ids=position_ids,
   1306         past_key_value=past_key_values,
   1307         output_attentions=output_attentions,
   1308         output_router_logits=output_router_logits,
   1309         use_cache=use_cache,
   1310         cache_position=cache_position,
   1311     )
   1313 hidden_states = layer_outputs[0]
   1315 if output_attentions:

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
   1530     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1531 else:
-> 1532     return self._call_impl(*args, **kwargs)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
   1536 # If we don't have any hooks, we want to skip the rest of the logic in
   1537 # this function, and just call forward.
   1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1539         or _global_backward_pre_hooks or _global_backward_hooks
   1540         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541     return forward_call(*args, **kwargs)
   1543 try:
   1544     result = None

File ~/.local/lib/python3.11/site-packages/accelerate/hooks.py:161, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    160 def new_forward(module, *args, **kwargs):
--> 161     args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
    162     if module._hf_hook.no_grad:
    163         with torch.no_grad():

File ~/.local/lib/python3.11/site-packages/accelerate/hooks.py:356, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
    345             self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
    347         set_module_tensor_to_device(
    348             module,
    349             name,
   (...)
    353             tied_params_map=self.tied_params_map,
    354         )
--> 356 return send_to_device(args, self.execution_device), send_to_device(
    357     kwargs, self.execution_device, skip_keys=self.skip_keys
    358 )

File ~/.local/lib/python3.11/site-packages/accelerate/utils/operations.py:186, in send_to_device(tensor, device, non_blocking, skip_keys)
    183     elif skip_keys is None:
    184         skip_keys = []
    185     return type(tensor)(
--> 186         {
    187             k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
    188             for k, t in tensor.items()
    189         }
    190     )
    191 else:
    192     return tensor

File ~/.local/lib/python3.11/site-packages/accelerate/utils/operations.py:187, in <dictcomp>(.0)
    183     elif skip_keys is None:
    184         skip_keys = []
    185     return type(tensor)(
    186         {
--> 187             k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
    188             for k, t in tensor.items()
    189         }
    190     )
    191 else:
    192     return tensor

File ~/.local/lib/python3.11/site-packages/accelerate/utils/operations.py:158, in send_to_device(tensor, device, non_blocking, skip_keys)
    156     tensor = tensor.cpu()
    157 try:
--> 158     return tensor.to(device, non_blocking=non_blocking)
    159 except TypeError:  # .to() doesn't accept non_blocking as kwarg
    160     return tensor.to(device)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1173, in Module.to(self, *args, **kwargs)
   1170         else:
   1171             raise
-> 1173 return self._apply(convert)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:778, in Module._apply(self, fn, recurse)
    776 def _apply(self, fn, recurse=True):
    777     if recurse:
--> 778         for module in self.children():
    779             module._apply(fn)
    781     def compute_should_use_set_data(tensor, tensor_applied):

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:2319, in Module.children(self)
   2313 def children(self) -> Iterator['Module']:
   2314     r"""Return an iterator over immediate children modules.
   2315 
   2316     Yields:
   2317         Module: a child module
   2318     """
-> 2319     for name, module in self.named_children():
   2320         yield module

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:2337, in Module.named_children(self)
   2323 r"""Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
   2324 
   2325 Yields:
   (...)
   2334 
   2335 """
   2336 memo = set()
-> 2337 for name, module in self._modules.items():
   2338     if module is not None and module not in memo:
   2339         memo.add(module)

File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1709, in Module.__getattr__(self, name)
   1707     if name in modules:
   1708         return modules[name]
-> 1709 raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")

AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'

Any help or pointers would be greatly appreciated!

Thank you!

Same issue. Any guidance on how to resolve it?

I was stuck with this issue for about 4 hours. The hacky solution is to go to "modeling_jamba.py" which can be found in the "transformers" site package. Go to 'HybridMambaAttentionDynamicCache' and make it inherit from 'nn.Module' as such:
class HybridMambaAttentionDynamicCache(DynamicCache, nn.Module):'

Then add the superclass intitialization to the '__init__'method:

def __init__(self, config, batch_size, dtype=torch.float16, device=None):
super(HybridMambaAttentionDynamicCache, self).__init__()

Kind of odd that no-one else had this problem before.

@legobuzz12 Thanks for figuring it out! Yes, with the suggested changes, the error is gone, but now I have the following error message:

Loading checkpoint shards: 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 21/21 [00:22<00:00, [52/1797$
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy (previous) behavior will be used so nothing changes fo
r you. If you want to use the new behaviour, set legacy=False. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/
24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:316: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
cuBLAS API failed with status 15
A: torch.Size([9, 4096]), B: torch.Size([14336, 4096]), C: (9, 14336); (lda, ldb, ldc): (c_int(288), c_int(458752), c_int(288)); (m, n, k): (c_int(9), c_int(14336), c_int(4096))
error detectedTraceback (most recent call last):
File "/home/tao/run_jamba.py", line 8, in [43/1797]
outputs = model.generate(input_ids, max_new_tokens=216, use_cache=False)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/generation/utils.py", line 2982, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 1481, in forward
outputs = self.model( [28/1797]
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 1303, in forward
layer_outputs = decoder_layer(
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs) [14/1797]
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 1051, in forward
ff_outputs = self.feed_forward(hidden_states)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 845, in forward
return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 817, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 395, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2341, in igemmlt
raise Exception("cublasLt ran into an error!")
Exception: cublasLt ran into an error!

I am wondering if it is related to the pytorch or cuda version. I am using torch 2.4.0 and cuda 12.2.

I was using CUDA 12.2 as well. The error you got makes me think that it is a GPU-specific issue, because this model tends to not work on older ones. I had an A100 in my setup and I ran into no CUDA errors.
Also, I would suggest using 'use_mamba_kernels=True' if it lets you.

AI21 org

This issue should be resolved in https://github.com/huggingface/transformers/releases/tag/v4.44.2, please try

@legobuzz12 @michael-go Thank you both. 'use_mamba_kernels=True' is not recognized by the model.generate function. Updating the transformers lib didn't resolve the cublasLt error as well. I will try with A100.

AI21 org

@taozhang9527 my comment about upgrading transformers was referring to the original message in this thread regarding the AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'.

The cublasLt error in your case is thrown from bitsandbytes, maybe related to quantization. And there is also autograd in the stack trace - are you trying to train or generate?

I would suggest maybe to open a separate thread about it and also provide some code snippet that reproduces it. πŸ™

Sign up or log in to comment