When torch.nn.functional.scaled_dot_product_attention calls _scaled_dot_product_attention_math, the model reports an error

by Quasimodo0808 - opened Jul 19, 2024

Jul 19, 2024

If the sdpa in visual.py::attention_fn_default() uses the math kernel, then its output is contiguous. The output is transposed(), and then view() is executed. https://huggingface.co/THUDM/cogvlm2-video-llama3-chat/blob/main/visual.py#L78 veiw() will report an error

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 19, 2024

try using with reshape

Knowledge Engineering Group (KEG) & Data Mining at Tsinghua University org Jul 19, 2024

You can try change output = self.dense(out.view(B, L, -1)) to output = self.dense(out.reshape(B, L, -1))

Jul 19, 2024

This comment has been hidden

Jul 19, 2024

You can try change output = self.dense(out.view(B, L, -1)) to output = self.dense(out.reshape(B, L, -1))

The reason for using view() is because you are using SDPA's flash-attention?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment