Is it possible to add logic for handling output_attentions?

#30
by abhinavkulkarni - opened

Hi,

I am trying to add GPTQ support for MPT models in the AutoGPTQ repository. Adding a support for a new model is relatively simpler, for e.g., looking at opt.py script for Facebook's OPT models, all one needs to do is specify names of nn.Linear layers that need to be quantized.

I did similar for MPT models, however I seem to be running into a problem at this line number. It seems the attentions are not being passed in the kwargs. How can that be remedied?

Thanks!

Mosaic ML, Inc. org

Hi @abhinavkulkarni , is the ask here if attention_mask is being passed as a kwarg to the forward of MPTForCausalLM?

Hey @sam-mosaic ,

Thanks for the reply. You can see here, output_attentions options is not specified yet in modeling_mpt.py: https://huggingface.co/mosaicml/mpt-7b/blob/main/modeling_mpt.py#L140

It would be nice if this if block were filled up instead of rasing NotImplementedError. I think it should be trivial given MPT uses traditional transformer, so collecting attention outputs from every hidden layer in the forward function and then returning it in a tuple.

You can see these line numbers from modeling_opt.py for a reference:

https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L245
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L368
https://github.com/huggingface/transformers/blob/main/src/transformers/models/opt/modeling_opt.py#L725

Thanks for the great work!

Mosaic ML, Inc. org

Thanks @abhinavkulkarni , I get it now. IIUC, output_attentions outputs the attention matrix from the attention module?

We do not use the torch code path much, we usually train with Triton Flash or CUDA Flash. However, neither of those attention implementations can support outputting the attention matrix. So, if we supported this flag it would only be for torch. Does AutoGPTQ mainly focus on lower-resource inference and fine-tuning?

Hey @sam-mosaic ,

So, it seems the recent changes have solved most of the issues, except for line 110 of modeling_mpt.py which needs to be changed from:

return (attn_bias, None)

to

return (attn_bias, attention_mask).

I made changes in my local copy of modeling_mpt.py in site-packages and was able to GPTQ quantize this model using AutoGPTQ repo.

Mosaic ML, Inc. org
edited Jun 6, 2023

To improve efficiency, in line 109 of modeling_mpt.py, we integrate attention_mask into attn_bias if it exists.
If the requested attn_impl does not support an attn bias, then we use attention_mask (eg attn_impl: flash does not support attn bias and therefore the output of the _attn_bias fn is (None, attention_mask); see line 88)

This does not control if output_attentions are available.

abhinavkulkarni changed discussion status to closed

Sign up or log in to comment