Error when dispatching model into multiple GPUs

#4
by justinwickett - opened

When calling

outputs = model.generate(**inputs, **gen_kwargs)

I encounter the following error:

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:63, in _run_priority_list(name, priority_list, inp)
     61 for op, not_supported in zip(priority_list, not_supported_reasons):
     62     msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 63 raise NotImplementedError(msg)

NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 1226, 16, 112) (torch.bfloat16)
     key         : shape=(1, 1226, 16, 112) (torch.bfloat16)
     value       : shape=(1, 1226, 16, 112) (torch.bfloat16)
     attn_bias   : <class 'NoneType'>
     p           : 0.0
`decoderF` is not supported because:
    attn_bias type is <class 'NoneType'>
    bf16 is only supported on A100+ GPUs
`flshattF@v2.3.2` is not supported because:
    requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
    bf16 is only supported on A100+ GPUs
`tritonflashattF` is not supported because:
    requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
    bf16 is only supported on A100+ GPUs
    operator wasn't built - see `python -m xformers.info` for more info
    triton is not available
    requires GPU with sm80 minimum compute capacity, e.g., A100/H100/L4
    Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
    bf16 is only supported on A100+ GPUs
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    dtype=torch.bfloat16 (supported: {torch.float32})
    has custom scale
    bf16 is only supported on A100+ GPUs
    unsupported embed per head: 112

Here is the code I am executing:

import torch
import requests
from PIL import Image
from transformers import AutoModelForCausalLM, LlamaTokenizer
from accelerate import init_empty_weights, infer_auto_device_map, load_checkpoint_and_dispatch

tokenizer = LlamaTokenizer.from_pretrained('lmsys/vicuna-7b-v1.5')

with init_empty_weights():
    model = AutoModelForCausalLM.from_pretrained('THUDM/cogvlm-chat-hf',
                                            torch_dtype=torch.bfloat16,
                                            low_cpu_mem_usage=True,
                                            trust_remote_code=True
                                           )
model = load_checkpoint_and_dispatch(
    model,
    '/home/ec2-user/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/8abca878c4257412c4c38eeafaed3fe27a036730',
    device_map="auto",
    no_split_module_classes=['CogVLMDecoderLayer']
)
model = model.eval()

query = 'Describe this image'
image = Image.open(requests.get('https://github.com/THUDM/CogVLM/blob/main/examples/1.png?raw=true', stream=True).raw).convert('RGB')
inputs = model.build_conversation_input_ids(tokenizer, query=query, history=[], images=[image])  # chat mode
inputs = {
    'input_ids': inputs['input_ids'].unsqueeze(0).to('cuda'),
    'token_type_ids': inputs['token_type_ids'].unsqueeze(0).to('cuda'),
    'attention_mask': inputs['attention_mask'].unsqueeze(0).to('cuda'),
    'images': [[inputs['images'][0].to('cuda').to(torch.bfloat16)]],
}
gen_kwargs = {"max_length": 2048, "do_sample": False}

with torch.no_grad():
    outputs = model.generate(**inputs, **gen_kwargs)
    outputs = outputs[:, inputs['input_ids'].shape[1]:]
    print(tokenizer.decode(outputs[0]))

And here is the output when I run

$ python -m xformers.info
xFormers 0.0.22.post7
memory_efficient_attention.cutlassF:               available
memory_efficient_attention.cutlassB:               available
memory_efficient_attention.decoderF:               available
memory_efficient_attention.flshattF@v2.3.2:        available
memory_efficient_attention.flshattB@v2.3.2:        available
memory_efficient_attention.smallkF:                available
memory_efficient_attention.smallkB:                available
memory_efficient_attention.tritonflashattF:        unavailable
memory_efficient_attention.tritonflashattB:        unavailable
memory_efficient_attention.triton_splitKF:         available
indexing.scaled_index_addF:                        available
indexing.scaled_index_addB:                        available
indexing.index_select:                             available
swiglu.dual_gemm_silu:                             available
swiglu.gemm_fused_operand_sum:                     available
swiglu.fused.p.cpp:                                available
is_triton_available:                               True
pytorch.version:                                   2.1.0+cu121
pytorch.cuda:                                      available
gpu.compute_capability:                            7.0
gpu.name:                                          Tesla V100-SXM2-16GB
build.info:                                        available
build.cuda_version:                                1201
build.python_version:                              3.10.13
build.torch_version:                               2.1.0+cu121
build.env.TORCH_CUDA_ARCH_LIST:                    5.0+PTX 6.0 6.1 7.0 7.5 8.0+PTX 9.0
build.env.XFORMERS_BUILD_TYPE:                     Release
build.env.XFORMERS_ENABLE_DEBUG_ASSERTIONS:        None
build.env.NVCC_FLAGS:                              None
build.env.XFORMERS_PACKAGE_FROM:                   wheel-v0.0.22.post7
build.nvcc_version:                                12.1.66
source.privacy:                                    open source

I also confirmed there were no dependency issues:

$ pip check
No broken requirements found.

Just discovered and read https://huggingface.co/THUDM/cogvlm-chat-hf/discussions/2#655ca4de93ac084b7d802358. @chenkq where can I find the model ckpt? I don't see it in this local path where I thought it would be:

$ pwd
/home/ec2-user/.cache/huggingface/hub/models--THUDM--cogvlm-chat-hf/snapshots/8abca878c4257412c4c38eeafaed3fe27a036730

$ ls -l
total 0
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:11 config.json -> ../../blobs/8b5a3c26659a087d6f41701ef63c8b094d36cad0
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:11 configuration_cogvlm.py -> ../../blobs/60d487ac530bb1aa6e13828e2a17ae2f521b02d1
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:31 generation_config.json -> ../../blobs/a4fee64c3bea2cc2488999c0611a0602930b7e93
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:21 model-00001-of-00008.safetensors -> ../../blobs/e29f6ec471ca55789ab14947b527729b9c30313ceb1e7726590b85f9f6406cca
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:22 model-00002-of-00008.safetensors -> ../../blobs/e82356882701b1a778408f31e676d17c2aff799c543e8596ed74bc805b4a1213
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:22 model-00003-of-00008.safetensors -> ../../blobs/04096f84f42798d0c89319ff8254995a2a3512c16ec88dfd078ce421867d92ec
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:22 model-00004-of-00008.safetensors -> ../../blobs/2b42af0bb16647959b3e55def4b3c66ab8c3a25fd948a5245c81d070f2b4313d
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:23 model-00005-of-00008.safetensors -> ../../blobs/38c07825790e055dd169376479994a58a4f59775ba7cf31d5ca25d8a465e7b0c
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:23 model-00006-of-00008.safetensors -> ../../blobs/d01880ca5677e69a5f8632f9dda62814f0c549b5a40d4f7e136065e5d64c1a7d
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:31 model-00007-of-00008.safetensors -> ../../blobs/e70b0e10d2ac8800e69e514b6a9b04ac28cd7db43985ce62daa4e0e639b4e5ba
lrwxrwxrwx 1 ec2-user ec2-user 76 Nov 24 23:31 model-00008-of-00008.safetensors -> ../../blobs/a756381ef65b92af7f1fb97da3d59cb04586080982de86d76805299898223294
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:11 modeling_cogvlm.py -> ../../blobs/4a7a748e2dbf349e260cd50771197cefc99c68b0
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:21 model.safetensors.index.json -> ../../blobs/197b878b0e9f9b1ffb97a7c4da1fb834fffcaf97
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 25 15:30 README.md -> ../../blobs/77cb0e1cf17f1a8f70e26cb86af0ed7f44ec1952
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:12 util.py -> ../../blobs/1dccacad2ded4c357ab9cb23d04027500256e281
lrwxrwxrwx 1 ec2-user ec2-user 52 Nov 24 23:12 visual.py -> ../../blobs/367e2fb1d19ff3b5c43fb898b7d15e436bbbf7cc
Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org

Based on the call stack information you provided, I suspect there might be an issue with the dispatch of the visual model. You could try modifying no_split_module_classes=['CogVLMDecoderLayer'] to no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer']. This adjustment should prevent the model from being fragmented excessively, thereby reducing the need to wrap sth like xformers.ops (I guess

Regarding your question, "Where can I find the model checkpoint (ckpt)?", in your case, the local path is models--THUDM--cogvlm-chat-hf/snapshots/8abca878c4257412c4c38eeafaed3fe27a036730. The files model-0000x-of-00008.safetensors and model.safetensors.index.json constitute the model checkpoint. Here, these are symbolic links pointing to files in blobs.

@chenkq I made the change you suggested, but I am still encountering the error reported above. Any other ideas?

Here is the output when I run model.hf_device_map after modifying no_split_module_classes=['CogVLMDecoderLayer', 'TransformerLayer']:

{'model.embed_tokens': 0,
 'model.layers.0': 0,
 'model.layers.1': 0,
 'model.layers.2': 0,
 'model.layers.3': 0,
 'model.layers.4': 0,
 'model.layers.5': 0,
 'model.layers.6': 0,
 'model.layers.7': 0,
 'model.layers.8': 0,
 'model.layers.9': 0,
 'model.layers.10': 1,
 'model.layers.11': 1,
 'model.layers.12': 1,
 'model.layers.13': 1,
 'model.layers.14': 1,
 'model.layers.15': 1,
 'model.layers.16': 1,
 'model.layers.17': 1,
 'model.layers.18': 1,
 'model.layers.19': 1,
 'model.layers.20': 1,
 'model.layers.21': 1,
 'model.layers.22': 2,
 'model.layers.23': 2,
 'model.layers.24': 2,
 'model.layers.25': 2,
 'model.layers.26': 2,
 'model.layers.27': 2,
 'model.layers.28': 2,
 'model.layers.29': 2,
 'model.layers.30': 2,
 'model.layers.31': 2,
 'model.norm': 2,
 'model.vision.boi': 2,
 'model.vision.eoi': 2,
 'model.vision.patch_embedding': 2,
 'model.vision.transformer.layers.0': 2,
 'model.vision.transformer.layers.1': 2,
 'model.vision.transformer.layers.2': 2,
 'model.vision.transformer.layers.3': 2,
 'model.vision.transformer.layers.4': 2,
 'model.vision.transformer.layers.5': 2,
 'model.vision.transformer.layers.6': 2,
 'model.vision.transformer.layers.7': 2,
 'model.vision.transformer.layers.8': 2,
 'model.vision.transformer.layers.9': 2,
 'model.vision.transformer.layers.10': 2,
 'model.vision.transformer.layers.11': 2,
 'model.vision.transformer.layers.12': 3,
 'model.vision.transformer.layers.13': 3,
 'model.vision.transformer.layers.14': 3,
 'model.vision.transformer.layers.15': 3,
 'model.vision.transformer.layers.16': 3,
 'model.vision.transformer.layers.17': 3,
 'model.vision.transformer.layers.18': 3,
 'model.vision.transformer.layers.19': 3,
 'model.vision.transformer.layers.20': 3,
 'model.vision.transformer.layers.21': 3,
 'model.vision.transformer.layers.22': 3,
 'model.vision.transformer.layers.23': 3,
 'model.vision.transformer.layers.24': 3,
 'model.vision.transformer.layers.25': 3,
 'model.vision.transformer.layers.26': 3,
 'model.vision.transformer.layers.27': 3,
 'model.vision.transformer.layers.28': 3,
 'model.vision.transformer.layers.29': 3,
 'model.vision.transformer.layers.30': 3,
 'model.vision.transformer.layers.31': 3,
 'model.vision.transformer.layers.32': 3,
 'model.vision.transformer.layers.33': 3,
 'model.vision.transformer.layers.34': 3,
 'model.vision.transformer.layers.35': 3,
 'model.vision.transformer.layers.36': 3,
 'model.vision.transformer.layers.37': 3,
 'model.vision.transformer.layers.38': 3,
 'model.vision.transformer.layers.39': 3,
 'model.vision.transformer.layers.40': 3,
 'model.vision.transformer.layers.41': 3,
 'model.vision.transformer.layers.42': 3,
 'model.vision.transformer.layers.43': 3,
 'model.vision.transformer.layers.44': 3,
 'model.vision.transformer.layers.45': 3,
 'model.vision.transformer.layers.46': 3,
 'model.vision.transformer.layers.47': 3,
 'model.vision.transformer.layers.48': 3,
 'model.vision.transformer.layers.49': 3,
 'model.vision.transformer.layers.50': 3,
 'model.vision.transformer.layers.51': 3,
 'model.vision.transformer.layers.52': 3,
 'model.vision.transformer.layers.53': 3,
 'model.vision.transformer.layers.54': 3,
 'model.vision.transformer.layers.55': 3,
 'model.vision.transformer.layers.56': 3,
 'model.vision.transformer.layers.57': 3,
 'model.vision.transformer.layers.58': 3,
 'model.vision.transformer.layers.59': 3,
 'model.vision.transformer.layers.60': 3,
 'model.vision.transformer.layers.61': 3,
 'model.vision.transformer.layers.62': 3,
 'model.vision.linear_proj': 3,
 'lm_head': 3}

@chenkq here is the full stack trace that occurs when running outputs = model.generate(**inputs, **gen_kwargs). Thank you 🙏

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
Cell In[5], line 2
      1 with torch.no_grad():
----> 2     outputs = model.generate(**inputs, **gen_kwargs)
      3     outputs = outputs[:, inputs['input_ids'].shape[1]:]
      4     print(tokenizer.decode(outputs[0]))

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/utils/_contextlib.py:115, in context_decorator.<locals>.decorate_context(*args, **kwargs)
    112 @functools.wraps(func)
    113 def decorate_context(*args, **kwargs):
    114     with ctx_factory():
--> 115         return func(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/transformers/generation/utils.py:1673, in GenerationMixin.generate(self, inputs, generation_config, logits_processor, stopping_criteria, prefix_allowed_tokens_fn, synced_gpus, assistant_model, streamer, negative_prompt_ids, negative_prompt_attention_mask, **kwargs)
   1656     return self.assisted_decoding(
   1657         input_ids,
   1658         assistant_model=assistant_model,
   (...)
   1669         **model_kwargs,
   1670     )
   1671 if generation_mode == GenerationMode.GREEDY_SEARCH:
   1672     # 11. run greedy search
-> 1673     return self.greedy_search(
   1674         input_ids,
   1675         logits_processor=logits_processor,
   1676         stopping_criteria=stopping_criteria,
   1677         pad_token_id=generation_config.pad_token_id,
   1678         eos_token_id=generation_config.eos_token_id,
   1679         output_scores=generation_config.output_scores,
   1680         return_dict_in_generate=generation_config.return_dict_in_generate,
   1681         synced_gpus=synced_gpus,
   1682         streamer=streamer,
   1683         **model_kwargs,
   1684     )
   1686 elif generation_mode == GenerationMode.CONTRASTIVE_SEARCH:
   1687     if not model_kwargs["use_cache"]:

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/transformers/generation/utils.py:2521, in GenerationMixin.greedy_search(self, input_ids, logits_processor, stopping_criteria, max_length, pad_token_id, eos_token_id, output_attentions, output_hidden_states, output_scores, return_dict_in_generate, synced_gpus, streamer, **model_kwargs)
   2518 model_inputs = self.prepare_inputs_for_generation(input_ids, **model_kwargs)
   2520 # forward pass to get next token
-> 2521 outputs = self(
   2522     **model_inputs,
   2523     return_dict=True,
   2524     output_attentions=output_attentions,
   2525     output_hidden_states=output_hidden_states,
   2526 )
   2528 if synced_gpus and this_peer_finished:
   2529     continue  # don't waste resources running the code we don't need

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/accelerate/hooks.py:164, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    162         output = module._old_forward(*args, **kwargs)
    163 else:
--> 164     output = module._old_forward(*args, **kwargs)
    165 return module._hf_hook.post_forward(module, output)

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/modeling_cogvlm.py:610, in CogVLMForCausalLM.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict, labels)
    607 return_dict = return_dict if return_dict is not None else self.config.use_return_dict
    609 # decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
--> 610 outputs = self.model(
    611     input_ids=input_ids,
    612     images=images,
    613     token_type_ids=token_type_ids,
    614     attention_mask=attention_mask,
    615     position_ids=position_ids,
    616     past_key_values=past_key_values,
    617     inputs_embeds=inputs_embeds,
    618     use_cache=use_cache,
    619     output_attentions=output_attentions,
    620     output_hidden_states=output_hidden_states,
    621     return_dict=return_dict,
    622 )
    624 hidden_states = outputs[0]
    625 logits = self.lm_head(hidden_states)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/modeling_cogvlm.py:378, in CogVLMModel.forward(self, input_ids, images, token_type_ids, attention_mask, position_ids, past_key_values, inputs_embeds, use_cache, output_attentions, output_hidden_states, return_dict)
    376 assert len(input_ids) == len(images), f"{len(input_ids)} {len(images)}"
    377 inputs_embeds = self.embed_tokens(input_ids)
--> 378 images_features = self.encode_images(images)
    379 images_features = rearrange(images_features, 'b n d -> (b n) d')
    380 images_features = images_features.to(dtype=inputs_embeds.dtype, device=inputs_embeds.device)

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/modeling_cogvlm.py:350, in CogVLMModel.encode_images(self, images)
    347         images.append(image)
    349 images = torch.stack(images)
--> 350 images_features = self.vision(images)
    351 return images_features

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:129, in EVA2CLIPModel.forward(self, images)
    127 def forward(self, images: "tensor(B, C, H, W)") -> "tensor(B, L, D)":
    128     x = self.patch_embedding(images)
--> 129     x = self.transformer(x)
    130     x = x[:, 1:]
    131     x = self.linear_proj(x)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:94, in Transformer.forward(self, hidden_states)
     92 def forward(self, hidden_states):
     93     for layer_module in self.layers:
---> 94         hidden_states = layer_module(hidden_states)
     95     return hidden_states

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:79, in TransformerLayer.forward(self, hidden_states)
     77 def forward(self, hidden_states):
     78     attention_input = hidden_states
---> 79     attention_output = self.input_layernorm(self.attention(attention_input))
     80     hidden_states = attention_input + attention_output
     81     mlp_input = hidden_states

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1518, in Module._wrapped_call_impl(self, *args, **kwargs)
   1516     return self._compiled_call_impl(*args, **kwargs)  # type: ignore[misc]
   1517 else:
-> 1518     return self._call_impl(*args, **kwargs)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/torch/nn/modules/module.py:1527, in Module._call_impl(self, *args, **kwargs)
   1522 # If we don't have any hooks, we want to skip the rest of the logic in
   1523 # this function, and just call forward.
   1524 if not (self._backward_hooks or self._backward_pre_hooks or self._forward_hooks or self._forward_pre_hooks
   1525         or _global_backward_pre_hooks or _global_backward_hooks
   1526         or _global_forward_hooks or _global_forward_pre_hooks):
-> 1527     return forward_call(*args, **kwargs)
   1529 try:
   1530     result = None

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/accelerate/hooks.py:164, in add_hook_to_module.<locals>.new_forward(module, *args, **kwargs)
    162         output = module._old_forward(*args, **kwargs)
    163 else:
--> 164     output = module._old_forward(*args, **kwargs)
    165 return module._hf_hook.post_forward(module, output)

File ~/.cache/huggingface/modules/transformers_modules/THUDM/cogvlm-chat-hf/8abca878c4257412c4c38eeafaed3fe27a036730/visual.py:40, in Attention.forward(self, x)
     37 qkv = qkv.reshape(B, L, 3, self.num_heads, -1).permute(2, 0, 1, 3, 4)  # 3, B, L, H, D
     38 q, k, v = qkv[0], qkv[1], qkv[2]
---> 40 out = xops.memory_efficient_attention(
     41     q, k, v, scale=self.scale,
     42 )
     43 output = self.dense(out.view(B, L, -1))
     44 output = self.output_dropout(output)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:223, in memory_efficient_attention(query, key, value, attn_bias, p, scale, op)
    116 def memory_efficient_attention(
    117     query: torch.Tensor,
    118     key: torch.Tensor,
   (...)
    124     op: Optional[AttentionOp] = None,
    125 ) -> torch.Tensor:
    126     """Implements the memory-efficient attention mechanism following
    127     `"Self-Attention Does Not Need O(n^2) Memory" <http://arxiv.org/abs/2112.05682>`_.
    128 
   (...)
    221     :return: multi-head attention Tensor with shape ``[B, Mq, H, Kv]``
    222     """
--> 223     return _memory_efficient_attention(
    224         Inputs(
    225             query=query, key=key, value=value, p=p, attn_bias=attn_bias, scale=scale
    226         ),
    227         op=op,
    228     )

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:321, in _memory_efficient_attention(inp, op)
    316 def _memory_efficient_attention(
    317     inp: Inputs, op: Optional[AttentionOp] = None
    318 ) -> torch.Tensor:
    319     # fast-path that doesn't require computing the logsumexp for backward computation
    320     if all(x.requires_grad is False for x in [inp.query, inp.key, inp.value]):
--> 321         return _memory_efficient_attention_forward(
    322             inp, op=op[0] if op is not None else None
    323         )
    325     output_shape = inp.normalize_bmhk()
    326     return _fMHA.apply(
    327         op, inp.query, inp.key, inp.value, inp.attn_bias, inp.p, inp.scale
    328     ).reshape(output_shape)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py:337, in _memory_efficient_attention_forward(inp, op)
    335 output_shape = inp.normalize_bmhk()
    336 if op is None:
--> 337     op = _dispatch_fw(inp, False)
    338 else:
    339     _ensure_op_supports_or_raise(ValueError, "memory_efficient_attention", op, inp)

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:120, in _dispatch_fw(inp, needs_gradient)
    111 def _dispatch_fw(inp: Inputs, needs_gradient: bool) -> Type[AttentionFwOpBase]:
    112     """Computes the best operator for forward
    113 
    114     Raises:
   (...)
    118         AttentionOp: The best operator for the configuration
    119     """
--> 120     return _run_priority_list(
    121         "memory_efficient_attention_forward",
    122         _dispatch_fw_priority_list(inp, needs_gradient),
    123         inp,
    124     )

File ~/jw-cogvlm-grounding-generalist-hf/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py:63, in _run_priority_list(name, priority_list, inp)
     61 for op, not_supported in zip(priority_list, not_supported_reasons):
     62     msg += "\n" + _format_not_supported_reasons(op, not_supported)
---> 63 raise NotImplementedError(msg)

NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(1, 1226, 16, 112) (torch.bfloat16)
     key         : shape=(1, 1226, 16, 112) (torch.bfloat16)
     value       : shape=(1, 1226, 16, 112) (torch.bfloat16)
     attn_bias   : <class 'NoneType'>
     p           : 0.0
`decoderF` is not supported because:
    attn_bias type is <class 'NoneType'>
    bf16 is only supported on A100+ GPUs
`flshattF@v2.3.2` is not supported because:
    requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
    bf16 is only supported on A100+ GPUs
`tritonflashattF` is not supported because:
    requires device with capability > (8, 0) but your GPU has capability (7, 0) (too old)
    bf16 is only supported on A100+ GPUs
    operator wasn't built - see `python -m xformers.info` for more info
    triton is not available
    requires GPU with sm80 minimum compute capacity, e.g., A100/H100/L4
    Only work on pre-MLIR triton for now
`cutlassF` is not supported because:
    bf16 is only supported on A100+ GPUs
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    dtype=torch.bfloat16 (supported: {torch.float32})
    has custom scale
    bf16 is only supported on A100+ GPUs
    unsupported embed per head: 112
Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org
edited Nov 27, 2023

oops, V100 does not support bf16, try fp16 instead.

Thanks @chenkq , that seems to have been my issue! I switched to NVIDIA A10G GPUs, and I am able to now use bf16.

justinwickett changed discussion status to closed

Sign up or log in to comment