No operator found for `memory_efficient_attention_forward` error when set use_memory_efficient_attention=True

#8
by xiewk - opened

Hi,

When I load the model with use_memory_efficient_attention=True, the inference failed with "No operator found for memory_efficient_attention_forward " error . Do you know why?

My code is as below, which is basically a copy of your sample code with only very slight changes.

    import torch.nn.functional as F
    import torch
    from transformers import AutoModel, AutoTokenizer, AutoConfig

    input_texts = [
        "what is the capital of China?",
        "how to implement quick sort in python?",
        "Beijing",
        "sorting algorithms"
    ]

    model_name_or_path = 'Alibaba-NLP/gte-large-en-v1.5'
    revision = 'main'
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path, trust_remote_code=True, code_revision=revision)
    model = AutoModel.from_pretrained(
        model_name_or_path,
        revision=revision,
        trust_remote_code=True,
        use_memory_efficient_attention=True,
        device_map='cuda',
    )

    batch_dict = tokenizer(input_texts, max_length=8192, padding=True, truncation=True, return_tensors='pt')

    with torch.autocast(device_type='cuda', dtype=torch.float16):
        outputs = model(**{
            k: v.to(model.device) if isinstance(v, torch.Tensor) else v
            for k, v in batch_dict.items()
        })

        embeddings = outputs.last_hidden_state[:, 0]

        # (Optionally) normalize embeddings
        embeddings = F.normalize(embeddings, p=2, dim=1)
        scores = (embeddings[:1] @ embeddings[1:].T) * 100
        print(scores.tolist())

My environment is Ubuntu 22.04 with Nvidia RTX 3050Ti GPU with CUDA 12.2.
Versions of key packages:

python=3.10
torch==2.3.0
xformers==0.0.26.post1
transformers=4.40.1

The error message:

  File "/home/victor/.cache/huggingface/modules/transformers_modules/Alibaba-NLP/new-impl/b7ea01ba91f26ef946f8c25261151b13aa502268/modeling.py", line 499, in forward
    context_layer = self.memory_efficient_attention(
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 268, in memory_efficient_attention
    return _memory_efficient_attention(
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 392, in _memory_efficient_attention
    return _fMHA.apply(
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/torch/autograd/function.py", line 598, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 67, in forward
    out, op_ctx = _memory_efficient_attention_forward_requires_grad(
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/__init__.py", line 417, in _memory_efficient_attention_forward_requires_grad
    op = _dispatch_fw(inp, True)
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 125, in _dispatch_fw
    return _run_priority_list(
  File "/home/victor/workspace/venv/lib/python3.10/site-packages/xformers/ops/fmha/dispatch.py", line 65, in _run_priority_list
    raise NotImplementedError(msg)
NotImplementedError: No operator found for `memory_efficient_attention_forward` with inputs:
     query       : shape=(4, 10, 16, 64) (torch.float16)
     key         : shape=(4, 10, 16, 64) (torch.float16)
     value       : shape=(4, 10, 16, 64) (torch.float16)
     attn_bias   : <class 'torch.Tensor'>
     p           : 0.0
`flshattF@v2.5.6` is not supported because:
    attn_bias type is <class 'torch.Tensor'>
`cutlassF` is not supported because:
    attn_bias.stride(-2) % 8 != 0 (attn_bias.stride() = (1600, 100, 10, 1))
    attn_bias.stride(-2) % 8 != 0 (attn_bias.stride() = (1600, 100, 10, 1))
    HINT: To use an `attn_bias` with a sequence length that is not a multiple of 8, you need to ensure memory is aligned by slicing a bigger tensor. Example: use `attn_bias = torch.zeros([1, 1, 5, 8])[:,:,:,:5]` instead of `torch.zeros([1, 1, 5, 5])`
`smallkF` is not supported because:
    max(query.shape[-1] != value.shape[-1]) > 32
    dtype=torch.float16 (supported: {torch.float32})
    bias with non-zero stride not supported
    unsupported embed per head: 64
Alibaba-NLP org

Hi, you could set unpad_inputs=True together with use_memory_efficient_attention=True, or set pad_to_multiple_of=8 when tokenizing.

izhx changed discussion status to closed
Alibaba-NLP org

And there is the output of a successful run.

All model checkpoint weights were used when initializing NewModel.

All the weights of NewModel were initialized from the model checkpoint at Alibaba-NLP/gte-large-en-v1.5.
If your task is similar to the task the model of the checkpoint was trained on, you can already use NewModel for predictions without further training.
[[41.875, 77.125, 37.03125]]
0

Sign up or log in to comment