Is it possible to add logic for handling output_attentions?

#30

by abhinavkulkarni - opened May 20, 2023

May 20, 2023

Hi,

I am trying to add GPTQ support for MPT models in the AutoGPTQ repository. Adding a support for a new model is relatively simpler, for e.g., looking at opt.py script for Facebook's OPT models, all one needs to do is specify names of nn.Linear layers that need to be quantized.

I did similar for MPT models, however I seem to be running into a problem at this line number. It seems the attentions are not being passed in the kwargs. How can that be remedied?

Thanks!

sam-mosaic

May 23, 2023

Hi @abhinavkulkarni , is the ask here if attention_mask is being passed as a kwarg to the forward of MPTForCausalLM?

abhinavkulkarni

May 23, 2023

•

edited May 23, 2023

Hey @sam-mosaic ,

Thanks for the reply. You can see here, output_attentions options is not specified yet in modeling_mpt.py: https://huggingface.co/mosaicml/mpt-7b/blob/main/modeling_mpt.py#L140

It would be nice if this if block were filled up instead of rasing NotImplementedError. I think it should be trivial given MPT uses traditional transformer, so collecting attention outputs from every hidden layer in the forward function and then returning it in a tuple.

You can see these line numbers from modeling_opt.py for a reference:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L245
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L368
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L725

Thanks for the great work!

sam-mosaic

May 25, 2023

Thanks @abhinavkulkarni , I get it now. IIUC, output_attentions outputs the attention matrix from the attention module?

We do not use the torch code path much, we usually train with Triton Flash or CUDA Flash. However, neither of those attention implementations can support outputting the attention matrix. So, if we supported this flag it would only be for torch. Does AutoGPTQ mainly focus on lower-resource inference and fine-tuning?

abhinavkulkarni

Jun 6, 2023

Hey @sam-mosaic ,

So, it seems the recent changes have solved most of the issues, except for line 110 of modeling_mpt.py which needs to be changed from:

return (attn_bias, None)

return (attn_bias, attention_mask).

I made changes in my local copy of modeling_mpt.py in site-packages and was able to GPTQ quantize this model using AutoGPTQ repo.

vchiley

Jun 6, 2023

•

edited Jun 6, 2023

To improve efficiency, in line 109 of modeling_mpt.py, we integrate attention_mask into attn_bias if it exists.
If the requested attn_impl does not support an attn bias, then we use attention_mask (eg attn_impl: flash does not support attn bias and therefore the output of the _attn_bias fn is (None, attention_mask); see line 88)

This does not control if output_attentions are available.

abhinavkulkarni changed discussion status to closed Jul 5, 2023

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment