MLP vs Self Attention

#1
by winglian - opened

Can you provide any insight on the rationale for training the MLP vs the Self attention modules?

Followed He et al. 2022, we hypothesize that this is because the FFN learns task-specific textual patterns (Geva et al., 2021), while attention learns pairwise positional interactions which do not require large capacity for adapting to new tasks.

[1] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, Graham Neubig: Towards a Unified View of Parameter-Efficient Transfer Learning. ICLR 2022
[2] Geva M, Schuster R, Berant J, et al. Transformer Feed-Forward Layers Are Key-Value Memories[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 2021: 5484-5495.

Sign up or log in to comment