Papers
arxiv:2309.01826

One Wide Feedforward is All You Need

Published on Sep 4, 2023
· Featured in Daily Papers on Sep 6, 2023
Authors:
,

Abstract

The Transformer architecture has two main non-embedding components: Attention and the Feed Forward Network (FFN). Attention captures interdependencies between words regardless of their position, while the FFN non-linearly transforms each input token independently. In this work we explore the role of the FFN, and find that despite taking up a significant fraction of the model's parameters, it is highly redundant. Concretely, we are able to substantially reduce the number of parameters with only a modest drop in accuracy by removing the FFN on the decoder layers and sharing a single FFN across the encoder. Finally we scale this architecture back to its original size by increasing the hidden dimension of the shared FFN, achieving substantial gains in both accuracy and latency with respect to the original Transformer Big.

Community

• The Feedforward Networks take up a lot of the model's parameters, but the authors found they are actually redundant. The model works nearly as well when they remove Feedforward Networks from the decoder side and share them on the encoder side.
• By removing redundancy, they can substantially reduce the total number of parameters without hurting accuracy much.
They also show they can scale the model back up by increasing the size of the shared Feedforward Network. This improves accuracy and speeds up computation compared to the original Transformer.

"Thy are able to reduce the size substantially (20-40%) by sharing or removing FFNs, with a tradeoff in BLEU accuracy. The most aggressive reduction that maintains accuracy is sharing across encoder and removing from decoder, reducing size by 22% while recovering the accuracy loss by making the shared encoder FFN wider."

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2309.01826 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2309.01826 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2309.01826 in a Space README.md to link it from this page.

Collections including this paper 19