AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'
- transformers Version: I tried 4.40.0 and 4.44.0
- pytorch version: 2.3.1
Code
from transformers import AutoModelForCausalLM, AutoTokenizer, JambaConfig
config = JambaConfig(use_mamba_kernels=False)
model = AutoModelForCausalLM.from_pretrained(
"ai21labs/Jamba-v0.1",
torch_dtype=torch.float16,
config=config,
attn_implementation="flash_attention_2",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("ai21labs/Jamba-v0.1")
input_ids = tokenizer("In the recent Super Bowl LVIII,", return_tensors='pt').to(model.device)["input_ids"]
outputs = model.generate(input_ids, max_new_tokens=216)
print(tokenizer.batch_decode(outputs))
Error Details
AttributeError Traceback (most recent call last)
Cell In[13], line 1
----> 1 outputs = model.generate(input_ids, max_new_tokens=216)
2 print(tokenizer.batch_decode(outputs))
3 # ["<|startoftext|>In the recent Super Bowl LVIII, the Kansas City Chiefs emerged victorious, defeating the San Francisco 49ers in a thrilling overtime showdown. The game was a nail-biter, with both teams showcasing their skills and determination.\n\nThe Chiefs, led by their star quarterback Patrick Mahomes, displayed their offensive prowess, while the 49ers, led by their strong defense, put up a tough fight. The game went into overtime, with the Chiefs ultimately securing the win with a touchdown.\n\nThe victory marked the Chiefs' second Super Bowl win in four years, solidifying their status as one of the top teams in the NFL. The game was a testament to the skill and talent of both teams, and a thrilling end to the NFL season.\n\nThe Super Bowl is not just about the game itself, but also about the halftime show and the commercials. This year's halftime show featured a star-studded lineup, including Usher, Alicia Keys, and Lil Jon. The show was a spectacle of music and dance, with the performers delivering an energetic and entertaining performance.\n"]
File ~/.local/lib/python3.11/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/.local/lib/python3.11/site-packages/transformers/generation/utils.py:2024, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
2016 input_ids, model_kwargs = self._expand_inputs_for_generation(
2017 input_ids=input_ids,
2018 expand_size=generation_config.num_return_sequences,
2019 is_encoder_decoder=self.config.is_encoder_decoder,
2020 **model_kwargs,
2021 )
2023 # 13. run sample (it degenerates to greedy search when `generation_config.do_sample=False`)
-> 2024 result = self._sample(
2025 input_ids,
2026 logits_processor=prepared_logits_processor,
2027 logits_warper=prepared_logits_warper,
2028 stopping_criteria=prepared_stopping_criteria,
2029 generation_config=generation_config,
2030 synced_gpus=synced_gpus,
2031 streamer=streamer,
2032 **model_kwargs,
2033 )
2035 elif generation_mode in (GenerationMode.BEAM_SAMPLE, GenerationMode.BEAM_SEARCH):
2036 # 11. prepare logits warper
2037 prepared_logits_warper = (
2038 self._get_logits_warper(generation_config, device=input_ids.device)
2039 if generation_config.do_sample
2040 else None
2041 )
File ~/.local/lib/python3.11/site-packages/transformers/generation/utils.py:2982, in GenerationMixin._sample(self, input_ids, logits_processor, stopping_criteria, generation_config, synced_gpus, streamer, logits_warper, **model_kwargs)
2979 model_inputs.update({"output_hidden_states": output_hidden_states} if output_hidden_states else {})
2981 # forward pass to get next token
-> 2982 outputs = self(**model_inputs, return_dict=True)
2984 if synced_gpus and this_peer_finished:
2985 continue # don't waste resources running the code we don't need
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File ~/.local/lib/python3.11/site-packages/accelerate/hooks.py:166, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
164 output = module._old_forward(*args, **kwargs)
165 else:
--> 166 output = module._old_forward(*args, **kwargs)
167 return module._hf_hook.post_forward(module, output)
File ~/.local/lib/python3.11/site-packages/transformers/models/jamba/modeling_jamba.py:1480, in JambaForCausalLM.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, labels, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict, cache_position, num_logits_to_keep)
1477 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
1479 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
-> 1480 outputs = self.model(
1481 input_ids=input_ids,
1482 attention_mask=attention_mask,
1483 position_ids=position_ids,
1484 past_key_values=past_key_values,
1485 inputs_embeds=inputs_embeds,
1486 use_cache=use_cache,
1487 output_attentions=output_attentions,
1488 output_hidden_states=output_hidden_states,
1489 output_router_logits=output_router_logits,
1490 cache_position=cache_position,
1491 return_dict=return_dict,
1492 )
1494 hidden_states = outputs[0]
1495 if num_logits_to_keep is None:
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File ~/.local/lib/python3.11/site-packages/transformers/models/jamba/modeling_jamba.py:1302, in JambaModel.forward(self, input_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, output_router_logits, return_dict, cache_position)
1290 layer_outputs = self._gradient_checkpointing_func(
1291 decoder_layer.__call__,
1292 hidden_states,
(...)
1299 cache_position,
1300 )
1301 else:
-> 1302 layer_outputs = decoder_layer(
1303 hidden_states,
1304 attention_mask=causal_mask,
1305 position_ids=position_ids,
1306 past_key_value=past_key_values,
1307 output_attentions=output_attentions,
1308 output_router_logits=output_router_logits,
1309 use_cache=use_cache,
1310 cache_position=cache_position,
1311 )
1313 hidden_states = layer_outputs[0]
1315 if output_attentions:
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1532, in Module._wrapped_call_impl(self, *args, **kwargs)
1530 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1531 else:
-> 1532 return self._call_impl(*args, **kwargs)
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1541, in Module._call_impl(self, *args, **kwargs)
1536 # If we don't have any hooks, we want to skip the rest of the logic in
1537 # this function, and just call forward.
1538 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1539 or _global_backward_pre_hooks or _global_backward_hooks
1540 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1541 return forward_call(*args, **kwargs)
1543 try:
1544 result = None
File ~/.local/lib/python3.11/site-packages/accelerate/hooks.py:161, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
160 def new_forward(module, *args, **kwargs):
--> 161 args, kwargs = module._hf_hook.pre_forward(module, *args, **kwargs)
162 if module._hf_hook.no_grad:
163 with torch.no_grad():
File ~/.local/lib/python3.11/site-packages/accelerate/hooks.py:356, in AlignDevicesHook.pre_forward(self, module, *args, **kwargs)
345 self.tied_pointers_to_remove.add((value.data_ptr(), self.execution_device))
347 set_module_tensor_to_device(
348 module,
349 name,
(...)
353 tied_params_map=self.tied_params_map,
354 )
--> 356 return send_to_device(args, self.execution_device), send_to_device(
357 kwargs, self.execution_device, skip_keys=self.skip_keys
358 )
File ~/.local/lib/python3.11/site-packages/accelerate/utils/operations.py:186, in send_to_device(tensor, device, non_blocking, skip_keys)
183 elif skip_keys is None:
184 skip_keys = []
185 return type(tensor)(
--> 186 {
187 k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
188 for k, t in tensor.items()
189 }
190 )
191 else:
192 return tensor
File ~/.local/lib/python3.11/site-packages/accelerate/utils/operations.py:187, in <dictcomp>(.0)
183 elif skip_keys is None:
184 skip_keys = []
185 return type(tensor)(
186 {
--> 187 k: t if k in skip_keys else send_to_device(t, device, non_blocking=non_blocking, skip_keys=skip_keys)
188 for k, t in tensor.items()
189 }
190 )
191 else:
192 return tensor
File ~/.local/lib/python3.11/site-packages/accelerate/utils/operations.py:158, in send_to_device(tensor, device, non_blocking, skip_keys)
156 tensor = tensor.cpu()
157 try:
--> 158 return tensor.to(device, non_blocking=non_blocking)
159 except TypeError: # .to() doesn't accept non_blocking as kwarg
160 return tensor.to(device)
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1173, in Module.to(self, *args, **kwargs)
1170 else:
1171 raise
-> 1173 return self._apply(convert)
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:778, in Module._apply(self, fn, recurse)
776 def _apply(self, fn, recurse=True):
777 if recurse:
--> 778 for module in self.children():
779 module._apply(fn)
781 def compute_should_use_set_data(tensor, tensor_applied):
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:2319, in Module.children(self)
2313 def children(self) -> Iterator['Module']:
2314 r"""Return an iterator over immediate children modules.
2315
2316 Yields:
2317 Module: a child module
2318 """
-> 2319 for name, module in self.named_children():
2320 yield module
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:2337, in Module.named_children(self)
2323 r"""Return an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
2324
2325 Yields:
(...)
2334
2335 """
2336 memo = set()
-> 2337 for name, module in self._modules.items():
2338 if module is not None and module not in memo:
2339 memo.add(module)
File ~/.local/lib/python3.11/site-packages/torch/nn/modules/module.py:1709, in Module.__getattr__(self, name)
1707 if name in modules:
1708 return modules[name]
-> 1709 raise AttributeError(f"'{type(self).__name__}' object has no attribute '{name}'")
AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'
Any help or pointers would be greatly appreciated!
Thank you!
Same issue. Any guidance on how to resolve it?
I was stuck with this issue for about 4 hours. The hacky solution is to go to "modeling_jamba.py" which can be found in the "transformers" site package. Go to 'HybridMambaAttentionDynamicCache' and make it inherit from 'nn.Module' as such:
class HybridMambaAttentionDynamicCache(DynamicCache, nn.Module):'
Then add the superclass intitialization to the '__init__'method:
def __init__(self, config, batch_size, dtype=torch.float16, device=None):
super(HybridMambaAttentionDynamicCache, self).__init__()
Kind of odd that no-one else had this problem before.
@legobuzz12 Thanks for figuring it out! Yes, with the suggested changes, the error is gone, but now I have the following error message:
Loading checkpoint shards: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 21/21 [00:22<00:00, [52/1797$
You are using the default legacy behaviour of the <class 'transformers.models.llama.tokenization_llama_fast.LlamaTokenizerFast'>. This is expected, and simply means that the legacy
(previous) behavior will be used so nothing changes fo
r you. If you want to use the new behaviour, set legacy=False
. This should only be set if you understand what it means, and thoroughly read the reason why this was added as explained in https://github.com/huggingface/transformers/pull/
24565 - if you loaded a llama tokenizer from a GGUF file you can ignore this message.
/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py:316: UserWarning: MatMul8bitLt: inputs will be cast from torch.bfloat16 to float16 during quantization
warnings.warn(f"MatMul8bitLt: inputs will be cast from {A.dtype} to float16 during quantization")
cuBLAS API failed with status 15
A: torch.Size([9, 4096]), B: torch.Size([14336, 4096]), C: (9, 14336); (lda, ldb, ldc): (c_int(288), c_int(458752), c_int(288)); (m, n, k): (c_int(9), c_int(14336), c_int(4096))
error detectedTraceback (most recent call last):
File "/home/tao/run_jamba.py", line 8, in [43/1797]
outputs = model.generate(input_ids, max_new_tokens=216, use_cache=False)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/generation/utils.py", line 2024, in generate
result = self._sample(
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/generation/utils.py", line 2982, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 1481, in forward
outputs = self.model( [28/1797]
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 1303, in forward
layer_outputs = decoder_layer(
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs) [14/1797]
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 1051, in forward
ff_outputs = self.feed_forward(hidden_states)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/transformers/models/jamba/modeling_jamba.py", line 845, in forward
return self.down_proj(self.act_fn(self.gate_proj(hidden_state)) * self.up_proj(hidden_state))
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/accelerate/hooks.py", line 169, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/nn/modules.py", line 817, in forward
out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 556, in matmul
return MatMul8bitLt.apply(A, B, out, bias, state)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/torch/autograd/function.py", line 574, in apply
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/autograd/_functions.py", line 395, in forward
out32, Sout32 = F.igemmlt(C32A, state.CxB, SA, state.SB)
File "/home/tao/miniforge3/envs/jamba_test/lib/python3.10/site-packages/bitsandbytes/functional.py", line 2341, in igemmlt
raise Exception("cublasLt ran into an error!")
Exception: cublasLt ran into an error!
I am wondering if it is related to the pytorch or cuda version. I am using torch 2.4.0 and cuda 12.2.
I was using CUDA 12.2 as well. The error you got makes me think that it is a GPU-specific issue, because this model tends to not work on older ones. I had an A100 in my setup and I ran into no CUDA errors.
Also, I would suggest using 'use_mamba_kernels=True' if it lets you.
This issue should be resolved in https://github.com/huggingface/transformers/releases/tag/v4.44.2, please try
@legobuzz12 @michael-go Thank you both. 'use_mamba_kernels=True' is not recognized by the model.generate function. Updating the transformers lib didn't resolve the cublasLt error as well. I will try with A100.
@taozhang9527
my comment about upgrading transformers was referring to the original message in this thread regarding the AttributeError: 'HybridMambaAttentionDynamicCache' object has no attribute '_modules'
.
The cublasLt
error in your case is thrown from bitsandbytes
, maybe related to quantization. And there is also autograd
in the stack trace - are you trying to train or generate?
I would suggest maybe to open a separate thread about it and also provide some code snippet that reproduces it. π