|
\pagebreak |
|
\section*{Two Feed-Forward Layers = Attention over Parameters}\label{sec:parameter_attention} |
|
|
|
In addition to attention layers, our model contains position-wise feed-forward networks (Section \ref{sec:ffn}), which consist of two linear transformations with a ReLU activation in between. In fact, these networks too can be seen as a form of attention. Compare the formula for such a network with the formula for a simple dot-product attention layer (biases and scaling factors omitted): |
|
|
|
\begin{align*} |
|
FFN(x, W_1, W_2) = ReLU(xW_1)W_2 \\ |
|
A(q, K, V) = Softmax(qK^T)V |
|
\end{align*} |
|
|
|
Based on the similarity of these formulae, the two-layer feed-forward network can be seen as a kind of attention, where the keys and values are the rows of the trainable parameter matrices $W_1$ and $W_2$, and where we use ReLU instead of Softmax in the compatibility function. |
|
|
|
|
|
|
|
Given this similarity, we experimented with replacing the position-wise feed-forward networks with attention layers similar to the ones we use everywhere else our model. The multi-head-attention-over-parameters sublayer is identical to the multi-head attention described in \ref{sec:multihead}, except that the "keys" and "values" inputs to each attention head are trainable model parameters, as opposed to being linear projections of a previous layer. These parameters are scaled up by a factor of $\sqrt{d_{model}}$ in order to be more similar to activations. |
|
|
|
In our first experiment, we replaced each position-wise feed-forward network with a multi-head-attention-over-parameters sublayer with $h_p=8$ heads, key-dimensionality $d_{pk}=64$, and value-dimensionality $d_{pv}=64$, using $n_p=1536$ key-value pairs for each attention head. The sublayer has a total of $2097152$ parameters, including the parameters in the query projection and the output projection. This matches the number of parameters in the position-wise feed-forward network that we replaced. While the theoretical amount of computation is also the same, in practice, the attention version caused the step times to be about 30\% longer. |
|
|
|
In our second experiment, we used $h_p=8$ heads, and $n_p=512$ key-value pairs for each attention head, again matching the total number of parameters in the base model. |
|
|
|
Results for the first experiment were slightly worse than for the base model, and results for the second experiment were slightly better, see Table~\ref{tab:parameter_attention}. |
|
|
|
\begin{table}[h] |
|
\caption{Replacing the position-wise feed-forward networks with multihead-attention-over-parameters produces similar results to the base model. All metrics are on the English-to-German translation development set, newstest2013.} |
|
\label{tab:parameter_attention} |
|
\begin{center} |
|
\vspace{-2mm} |
|
|
|
\begin{tabular}{c|cccccc|cccc} |
|
\hline\rule{0pt}{2.0ex} |
|
& \multirow{2}{*}{$\dmodel$} & \multirow{2}{*}{$\dff$} & |
|
\multirow{2}{*}{$h_p$} & \multirow{2}{*}{$d_{pk}$} & \multirow{2}{*}{$d_{pv}$} & |
|
\multirow{2}{*}{$n_p$} & |
|
PPL & BLEU & params & training\\ |
|
& & & & & & & (dev) & (dev) & $\times10^6$ & time \\ |
|
\hline\rule{0pt}{2.0ex} |
|
base & 512 & 2048 & & & & & 4.92 & 25.8 & 65 & 12 hours\\ |
|
\hline\rule{0pt}{2.0ex} |
|
AOP$_1$ & 512 & & 8 & 64 & 64 & 1536 & 4.92& 25.5 & 65 & 16 hours\\ |
|
AOP$_2$ & 512 & & 16 & 64 & 64 & 512 & \textbf{4.86} & \textbf{25.9} & 65 & 16 hours \\ |
|
\hline |
|
\end{tabular} |
|
|
|
\end{center} |
|
\end{table} |
|
|