Papers
arxiv:2302.06461

A Study on ReLU and Softmax in Transformer

Published on Feb 13, 2023
Authors:
,
,
Xu Tan ,
,
,

Abstract

The Transformer architecture consists of self-attention and feed-forward networks (FFNs) which can be viewed as key-value memories according to previous works. However, FFN and traditional memory utilize different activation functions (i.e., ReLU and Softmax respectively), which makes them not equivalent. In this paper, we first rebuild the connections between FFN and key-value memory by conducting extensive studies on ReLU and Softmax, and find they are equivalent when adding an additional layer normalization module on Softmax. In addition, ReLU outperforms Softmax on both FFN and key-value memory when the number of value slots is large. We analyze the reasons and then explore this good property of ReLU on the self-attention network where the original Softmax activation performs poorly on long input sequences. We then propose a full ReLU architecture named ReLUFormer which performs better than the baseline Transformer on long sequence tasks such as document translation. This paper sheds light on the following points: 1) Softmax and ReLU use different normalization methods over elements which lead to different variances of results, and ReLU is good at dealing with a large number of key-value slots; 2) FFN and key-value memory are equivalent, and thus the Transformer can be viewed as a memory network where FFNs and self-attention networks are both key-value memories.

Community

Proposes ReLUFormer: transformer with all ReLUs; does a study/comparison of FFN and traditional memory activations in transformers; LayerNorm on softmax is equivalent to ReLU, introduces self-attention network/module with ReLU (instead of softmax), ReLU is better with larger sequences/key-value shots; transformer as a memory network: FFNs and self-attention are KV memories. Two layer FFN and KV memory networks (they learn a key, value pair for input and output is softmax of input dot key transpose multiplied with value) are similar; so is self-attention networks (SAN). Used IWSLT14 De-En translation task: replace FFN with KV memory; variance ratio of softmax is less, using LayerNorm over it will increase it (and give performance comparable to ReLU). ReLU has stronger capacity for larger model sizes (softmax value and gradient is too small, sub-optimal memory utilization/over-centralized distribution). Directly replacing softmax with ReLU leads to instability in training process; (variance) distribution of output grows with input sequence length. Proposes a variance reduction factor (like normalization factor in attention) over using ReLU in SAN; also proposes entropy-margin regularization loss (for containing entropy). For transformer decoder: causal self-attention has tokens of different lengths (after current one is masked off). ReLUFormer has better translation accuracy and faster speed (than vanilla transformer, Sparsemax, 1.5Entmax, and ReLA). Also has better results on document translaiton (form Europarl7 En-De); especially in long sequence length datasets. ReLU can capture more distant correlations and be more effective (qualitative attention visualization). Appendix has dataset, implementation details, document translation dataset, and baseline details. From Zhejiang, Microsoft.

Links: PapersWithCode

I could not find the concrete value of gamma used in the proposed ReLU-based attention. How to tune this and what's the search space?

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2302.06461 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2302.06461 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2302.06461 in a Space README.md to link it from this page.

Collections including this paper 1