Extracting attention maps

#49
by roeehendel - opened

It seems that since the model is using scaled_dot_product_attention, passing output_attentions=True to the forward is not supported.
Also, attention masking using attention_mask is not supported (this fails silently, there is no assertion to warn the user).
Is there a workaround to enable using these features?
Perhaps there should be an option to use a regular implementation of attention instead of scaled_dot_product_attention.

Sign up or log in to comment