Error when dispatching model into multiple GPUs
When calling
outputs = model.generate(**inputs, **gen_kwargs)
I encounter the following error:
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:63, in _run_priority_list(name, priority_list, inp)
61 for op, not_supported in zip(priority_list, not_supported_reasons):
62 msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 63 raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
query : shape=(1, 1226, 16, 112) (torch.bfloat16)
key : shape=(1, 1226, 16, 112) (torch.bfloat16)
value : shape=(1, 1226, 16, 112) (torch.bfloat16)
attn_bias : <class 'NoneType'>
p : 0.0
`decoderF` is not supported because:
attn_bias type is <class 'NoneType'>
bf16 is only supported on A100+ GPUs
`flshattF@v2.3.2` is not supported because:
requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
bf16 is only supported on A100+ GPUs
`tritonflashattF` is not supported because:
requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
bf16 is only supported on A100+ GPUs
operator wasn't built - see `python -m xformers.info` for more info
triton is not available
requires GPU with sm80 minimum compute capacity, e.g., A100/H100/L4
Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
bf16 is only supported on A100+ GPUs
`smallkF` is not supported because:
max(query.shape[-1] != value.shape[-1]) > 32
dtype=torch.bfloat16 (supported: {torch.float32})
has custom scale
bf16 is only supported on A100+ GPUs
unsupported embed per head: 112
Here is the code I am executing:
import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch
tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')
with init_empty_weights():
model = AutoModelForCausalLM.from_pretrained('THUDM/cogvlm-chat-hf',
torch_dtype=torch.bfloat16,
low_cpu_mem_usage=True,
trust_remote_code=True
)
model = load_checkpoint_and_dispatch(
model,
'/home/ec2-user/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/8abca878c4257412c4c38eeafaed3fe27a036730',
device_map="auto",
no_split_module_classes=['CogVLMDecoderLayer']
)
model = model.eval()
query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image]) # chat mode
inputs = {
'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}
with torch.no_grad():
outputs = model.generate(**inputs, **gen_kwargs)
outputs = outputs[:, inputs['input_ids'].shape[1]:]
print(tokenizer.decode(outputs[0]))
And here is the output when I run
$ python -m xformers.info
xFormers 0.0.22.post7
memory_efficient_attention.cutlassF: available
memory_efficient_attention.cutlassB: available
memory_efficient_attention.decoderF: available
memory_efficient_attention.flshattF@v2.3.2: available
memory_efficient_attention.flshattB@v2.3.2: available
memory_efficient_attention.smallkF: available
memory_efficient_attention.smallkB: available
memory_efficient_attention.tritonflashattF: unavailable
memory_efficient_attention.tritonflashattB: unavailable
memory_efficient_attention.triton_splitKF: available
indexing.scaled_index_addF: available
indexing.scaled_index_addB: available
indexing.index_select: available
swiglu.dual_gemm_silu: available
swiglu.gemm_fused_operand_sum: available
swiglu.fused.p.cpp: available
is_triton_available: True
pytorch.version: 2.1.0+cu121
pytorch.cuda: available
gpu.compute_capability: 7.0
gpu.name: Tesla V100-SXM2-16GB
build.info: available
build.cuda_version: 1201
build.python_version: 3.10.13
build.torch_version: 2.1.0+cu121
build.env.TORCH_CUDA_ARCH_LIST: 5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE: Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS: None
build.env.NVCC_FLAGS: None
build.env.XFORMERS_PACKAGE_FROM: wheel-v0.0.22.post7
build.nvcc_version: 12.1.66
source.privacy: open source
I also confirmed there were no dependency issues:
$ pip check
No broken requirements found.
Just discovered and read https://huggingface.co/THUDM/cogvlm-chat-hf/discussions/2#655ca4de93ac084b7d802358. @chenkq where can I find the model ckpt? I don't see it in this local path where I thought it would be:
$ pwd
/home/ec2-user/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/8abca878c4257412c4c38eeafaed3fe27a036730
$ ls -l
total 0
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:11 config.json -> ../../blobs/8b5a3c26659a087d6f41701ef63c8b094d36cad0
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:11 configuration_cogvlm.py -> ../../blobs/60d487ac530bb1aa6e13828e2a17ae2f521b02d1
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:31 generation_config.json -> ../../blobs/a4fee64c3bea2cc2488999c0611a0602930b7e93
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:21 model-00001-of-00008.safetensors -> ../../blobs/e29f6ec471ca55789ab14947b527729b9c30313ceb1e7726590b85f9f6406cca
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:22 model-00002-of-00008.safetensors -> ../../blobs/e82356882701b1a778408f31e676d17c2aff799c543e8596ed74bc805b4a1213
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:22 model-00003-of-00008.safetensors -> ../../blobs/04096f84f42798d0c89319ff8254995a2a3512c16ec88dfd078ce421867d92ec
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:22 model-00004-of-00008.safetensors -> ../../blobs/2b42af0bb16647959b3e55def4b3c66ab8c3a25fd948a5245c81d070f2b4313d
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:23 model-00005-of-00008.safetensors -> ../../blobs/38c07825790e055dd169376479994a58a4f59775ba7cf31d5ca25d8a465e7b0c
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:23 model-00006-of-00008.safetensors -> ../../blobs/d01880ca5677e69a5f8632f9dda62814f0c549b5a40d4f7e136065e5d64c1a7d
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:31 model-00007-of-00008.safetensors -> ../../blobs/e70b0e10d2ac8800e69e514b6a9b04ac28cd7db43985ce62daa4e0e639b4e5ba
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:31 model-00008-of-00008.safetensors -> ../../blobs/a756381ef65b92af7f1fb97da3d59cb04586080982de86d76805299898223294
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:11 modeling_cogvlm.py -> ../../blobs/4a7a748e2dbf349e260cd50771197cefc99c68b0
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:21 model.safetensors.index.json -> ../../blobs/197b878b0e9f9b1ffb97a7c4da1fb834fffcaf97
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 25 15:30 README.md -> ../../blobs/77cb0e1cf17f1a8f70e26cb86af0ed7f44ec1952
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:12 util.py -> ../../blobs/1dccacad2ded4c357ab9cb23d04027500256e281
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:12 visual.py -> ../../blobs/367e2fb1d19ff3b5c43fb898b7d15e436bbbf7cc
Based on the call stack information you provided, I suspect there might be an issue with the dispatch of the visual model. You could try modifying no_split_module_classes=['CogVLMDecoderLayer']
to no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer']
. This adjustment should prevent the model from being fragmented excessively, thereby reducing the need to wrap sth like xformers.ops (I guess
Regarding your question, "Where can I find the model checkpoint (ckpt)?", in your case, the local path is models--THUDM--cogvlm-chat-hf/snapshots/8abca878c4257412c4c38eeafaed3fe27a036730
. The files model-0000x-of-00008.safetensors
and model.safetensors.index.json
constitute the model checkpoint. Here, these are symbolic links pointing to files in blobs.
@chenkq I made the change you suggested, but I am still encountering the error reported above. Any other ideas?
Here is the output when I run model.hf_device_map
after modifying no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer']
:
{'model.embed_tokens': 0,
'model.layers.0': 0,
'model.layers.1': 0,
'model.layers.2': 0,
'model.layers.3': 0,
'model.layers.4': 0,
'model.layers.5': 0,
'model.layers.6': 0,
'model.layers.7': 0,
'model.layers.8': 0,
'model.layers.9': 0,
'model.layers.10': 1,
'model.layers.11': 1,
'model.layers.12': 1,
'model.layers.13': 1,
'model.layers.14': 1,
'model.layers.15': 1,
'model.layers.16': 1,
'model.layers.17': 1,
'model.layers.18': 1,
'model.layers.19': 1,
'model.layers.20': 1,
'model.layers.21': 1,
'model.layers.22': 2,
'model.layers.23': 2,
'model.layers.24': 2,
'model.layers.25': 2,
'model.layers.26': 2,
'model.layers.27': 2,
'model.layers.28': 2,
'model.layers.29': 2,
'model.layers.30': 2,
'model.layers.31': 2,
'model.norm': 2,
'model.vision.boi': 2,
'model.vision.eoi': 2,
'model.vision.patch_embedding': 2,
'model.vision.transformer.layers.0': 2,
'model.vision.transformer.layers.1': 2,
'model.vision.transformer.layers.2': 2,
'model.vision.transformer.layers.3': 2,
'model.vision.transformer.layers.4': 2,
'model.vision.transformer.layers.5': 2,
'model.vision.transformer.layers.6': 2,
'model.vision.transformer.layers.7': 2,
'model.vision.transformer.layers.8': 2,
'model.vision.transformer.layers.9': 2,
'model.vision.transformer.layers.10': 2,
'model.vision.transformer.layers.11': 2,
'model.vision.transformer.layers.12': 3,
'model.vision.transformer.layers.13': 3,
'model.vision.transformer.layers.14': 3,
'model.vision.transformer.layers.15': 3,
'model.vision.transformer.layers.16': 3,
'model.vision.transformer.layers.17': 3,
'model.vision.transformer.layers.18': 3,
'model.vision.transformer.layers.19': 3,
'model.vision.transformer.layers.20': 3,
'model.vision.transformer.layers.21': 3,
'model.vision.transformer.layers.22': 3,
'model.vision.transformer.layers.23': 3,
'model.vision.transformer.layers.24': 3,
'model.vision.transformer.layers.25': 3,
'model.vision.transformer.layers.26': 3,
'model.vision.transformer.layers.27': 3,
'model.vision.transformer.layers.28': 3,
'model.vision.transformer.layers.29': 3,
'model.vision.transformer.layers.30': 3,
'model.vision.transformer.layers.31': 3,
'model.vision.transformer.layers.32': 3,
'model.vision.transformer.layers.33': 3,
'model.vision.transformer.layers.34': 3,
'model.vision.transformer.layers.35': 3,
'model.vision.transformer.layers.36': 3,
'model.vision.transformer.layers.37': 3,
'model.vision.transformer.layers.38': 3,
'model.vision.transformer.layers.39': 3,
'model.vision.transformer.layers.40': 3,
'model.vision.transformer.layers.41': 3,
'model.vision.transformer.layers.42': 3,
'model.vision.transformer.layers.43': 3,
'model.vision.transformer.layers.44': 3,
'model.vision.transformer.layers.45': 3,
'model.vision.transformer.layers.46': 3,
'model.vision.transformer.layers.47': 3,
'model.vision.transformer.layers.48': 3,
'model.vision.transformer.layers.49': 3,
'model.vision.transformer.layers.50': 3,
'model.vision.transformer.layers.51': 3,
'model.vision.transformer.layers.52': 3,
'model.vision.transformer.layers.53': 3,
'model.vision.transformer.layers.54': 3,
'model.vision.transformer.layers.55': 3,
'model.vision.transformer.layers.56': 3,
'model.vision.transformer.layers.57': 3,
'model.vision.transformer.layers.58': 3,
'model.vision.transformer.layers.59': 3,
'model.vision.transformer.layers.60': 3,
'model.vision.transformer.layers.61': 3,
'model.vision.transformer.layers.62': 3,
'model.vision.linear_proj': 3,
'lm_head': 3}
@chenkq
here is the full stack trace that occurs when running outputs = model.generate(**inputs, **gen_kwargs)
. Thank you 🙏
---------------------------------------------------------------------------
NotImplementedError Traceback (most recent call last)
Cell In[5], line 2
1 with torch.no_grad():
----> 2 outputs = model.generate(**inputs, **gen_kwargs)
3 outputs = outputs[:, inputs['input_ids'].shape[1]:]
4 print(tokenizer.decode(outputs[0]))
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
112 @functools.wraps(func)
113 def decorate_context(*args, **kwargs):
114 with ctx_factory():
--> 115 return func(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/transformers/generation/utils.py:1673, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
1656 return self.assisted_decoding(
1657 input_ids,
1658 assistant_model=assistant_model,
(...)
1669 **model_kwargs,
1670 )
1671 if generation_mode == GenerationMode.GREEDY_SEARCH:
1672 # 11. run greedy search
-> 1673 return self.greedy_search(
1674 input_ids,
1675 logits_processor=logits_processor,
1676 stopping_criteria=stopping_criteria,
1677 pad_token_id=generation_config.pad_token_id,
1678 eos_token_id=generation_config.eos_token_id,
1679 output_scores=generation_config.output_scores,
1680 return_dict_in_generate=generation_config.return_dict_in_generate,
1681 synced_gpus=synced_gpus,
1682 streamer=streamer,
1683 **model_kwargs,
1684 )
1686 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
1687 if not model_kwargs["use_cache"]:
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/transformers/generation/utils.py:2521, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
2518 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
2520 # forward pass to get next token
-> 2521 outputs = self(
2522 **model_inputs,
2523 return_dict=True,
2524 output_attentions=output_attentions,
2525 output_hidden_states=output_hidden_states,
2526 )
2528 if synced_gpus and this_peer_finished:
2529 continue # don't waste resources running the code we don't need
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/accelerate/hooks.py:164, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
162 output = module._old_forward(*args, **kwargs)
163 else:
--> 164 output = module._old_forward(*args, **kwargs)
165 return module._hf_hook.post_forward(module, output)
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/modeling_cogvlm.py:610, in CogVLMForCausalLM.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, labels)
607 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
609 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 610 outputs = self.model(
611 input_ids=input_ids,
612 images=images,
613 token_type_ids=token_type_ids,
614 attention_mask=attention_mask,
615 position_ids=position_ids,
616 past_key_values=past_key_values,
617 inputs_embeds=inputs_embeds,
618 use_cache=use_cache,
619 output_attentions=output_attentions,
620 output_hidden_states=output_hidden_states,
621 return_dict=return_dict,
622 )
624 hidden_states = outputs[0]
625 logits = self.lm_head(hidden_states)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/modeling_cogvlm.py:378, in CogVLMModel.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
376 assert len(input_ids) == len(images), f"{len(input_ids)} {len(images)}"
377 inputs_embeds = self.embed_tokens(input_ids)
--> 378 images_features = self.encode_images(images)
379 images_features = rearrange(images_features, 'b n d -> (b n) d')
380 images_features = images_features.to(dtype=inputs_embeds.dtype, device=inputs_embeds.device)
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/modeling_cogvlm.py:350, in CogVLMModel.encode_images(self, images)
347 images.append(image)
349 images = torch.stack(images)
--> 350 images_features = self.vision(images)
351 return images_features
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:129, in EVA2CLIPModel.forward(self, images)
127 def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
128 x = self.patch_embedding(images)
--> 129 x = self.transformer(x)
130 x = x[:, 1:]
131 x = self.linear_proj(x)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:94, in Transformer.forward(self, hidden_states)
92 def forward(self, hidden_states):
93 for layer_module in self.layers:
---> 94 hidden_states = layer_module(hidden_states)
95 return hidden_states
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:79, in TransformerLayer.forward(self, hidden_states)
77 def forward(self, hidden_states):
78 attention_input = hidden_states
---> 79 attention_output = self.input_layernorm(self.attention(attention_input))
80 hidden_states = attention_input + attention_output
81 mlp_input = hidden_states
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
1516 return self._compiled_call_impl(*args, **kwargs) # type: ignore[misc]
1517 else:
-> 1518 return self._call_impl(*args, **kwargs)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
1522 # If we don't have any hooks, we want to skip the rest of the logic in
1523 # this function, and just call forward.
1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
1525 or _global_backward_pre_hooks or _global_backward_hooks
1526 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527 return forward_call(*args, **kwargs)
1529 try:
1530 result = None
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/accelerate/hooks.py:164, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
162 output = module._old_forward(*args, **kwargs)
163 else:
--> 164 output = module._old_forward(*args, **kwargs)
165 return module._hf_hook.post_forward(module, output)
File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:40, in Attention.forward(self, x)
37 qkv = qkv.reshape(B, L, 3, self.num_heads, -1).permute(2, 0, 1, 3, 4) # 3, B, L, H, D
38 q, k, v = qkv[0], qkv[1], qkv[2]
---> 40 out = xops.memory_efficient_attention(
41 q, k, v, scale=self.scale,
42 )
43 output = self.dense(out.view(B, L, -1))
44 output = self.output_dropout(output)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:223, in memory_efficient_attention(query, key, value, attn_bias, p, scale, op)
116 def memory_efficient_attention(
117 query: torch.Tensor,
118 key: torch.Tensor,
(...)
124 op: Optional[AttentionOp] = None,
125 ) -> torch.Tensor:
126 """Implements the memory-efficient attention mechanism following
127 `"Self-Attention Does Not Need O(n^2) Memory" <http://arxiv.org/abs/2112.05682>`_.
128
(...)
221 :return: multi-head attention Tensor with shape ``[B, Mq, H, Kv]``
222 """
--> 223 return _memory_efficient_attention(
224 Inputs(
225 query=query, key=key, value=value, p=p, attn_bias=attn_bias, scale=scale
226 ),
227 op=op,
228 )
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:321, in _memory_efficient_attention(inp, op)
316 def _memory_efficient_attention(
317 inp: Inputs, op: Optional[AttentionOp] = None
318 ) -> torch.Tensor:
319 # fast-path that doesn't require computing the logsumexp for backward computation
320 if all(x.requires_grad is False for x in [inp.query, inp.key, inp.value]):
--> 321 return _memory_efficient_attention_forward(
322 inp, op=op[0] if op is not None else None
323 )
325 output_shape = inp.normalize_bmhk()
326 return _fMHA.apply(
327 op, inp.query, inp.key, inp.value, inp.attn_bias, inp.p, inp.scale
328 ).reshape(output_shape)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:337, in _memory_efficient_attention_forward(inp, op)
335 output_shape = inp.normalize_bmhk()
336 if op is None:
--> 337 op = _dispatch_fw(inp, False)
338 else:
339 _ensure_op_supports_or_raise(ValueError, "memory_efficient_attention", op, inp)
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:120, in _dispatch_fw(inp, needs_gradient)
111 def _dispatch_fw(inp: Inputs, needs_gradient: bool) -> Type[AttentionFwOpBase]:
112 """Computes the best operator for forward
113
114 Raises:
(...)
118 AttentionOp: The best operator for the configuration
119 """
--> 120 return _run_priority_list(
121 "memory_efficient_attention_forward",
122 _dispatch_fw_priority_list(inp, needs_gradient),
123 inp,
124 )
File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:63, in _run_priority_list(name, priority_list, inp)
61 for op, not_supported in zip(priority_list, not_supported_reasons):
62 msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 63 raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
query : shape=(1, 1226, 16, 112) (torch.bfloat16)
key : shape=(1, 1226, 16, 112) (torch.bfloat16)
value : shape=(1, 1226, 16, 112) (torch.bfloat16)
attn_bias : <class 'NoneType'>
p : 0.0
`decoderF` is not supported because:
attn_bias type is <class 'NoneType'>
bf16 is only supported on A100+ GPUs
`flshattF@v2.3.2` is not supported because:
requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
bf16 is only supported on A100+ GPUs
`tritonflashattF` is not supported because:
requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
bf16 is only supported on A100+ GPUs
operator wasn't built - see `python -m xformers.info` for more info
triton is not available
requires GPU with sm80 minimum compute capacity, e.g., A100/H100/L4
Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
bf16 is only supported on A100+ GPUs
`smallkF` is not supported because:
max(query.shape[-1] != value.shape[-1]) > 32
dtype=torch.bfloat16 (supported: {torch.float32})
has custom scale
bf16 is only supported on A100+ GPUs
unsupported embed per head: 112
oops, V100 does not support bf16
, try fp16
instead.
Thanks
@chenkq
, that seems to have been my issue! I switched to NVIDIA A10G GPUs, and I am able to now use bf16
.