Spaces:
Sleeping
Sleeping
\pagebreak | |
\section*{Two Feed-Forward Layers = Attention over Parameters}\label{sec:parameter_attention} | |
In addition to attention layers, our model contains position-wise feed-forward networks (Section \ref{sec:ffn}), which consist of two linear transformations with a ReLU activation in between. In fact, these networks too can be seen as a form of attention. Compare the formula for such a network with the formula for a simple dot-product attention layer (biases and scaling factors omitted): | |
\begin{align*} | |
FFN(x, W_1, W_2) = ReLU(xW_1)W_2 \\ | |
A(q, K, V) = Softmax(qK^T)V | |
\end{align*} | |
Based on the similarity of these formulae, the two-layer feed-forward network can be seen as a kind of attention, where the keys and values are the rows of the trainable parameter matrices $W_1$ and $W_2$, and where we use ReLU instead of Softmax in the compatibility function. | |
%the compatablity function is $compat(q, k_i) = ReLU(q \cdot k_i)$ instead of $Softmax(qK_T)_i$. | |
Given this similarity, we experimented with replacing the position-wise feed-forward networks with attention layers similar to the ones we use everywhere else our model. The multi-head-attention-over-parameters sublayer is identical to the multi-head attention described in \ref{sec:multihead}, except that the "keys" and "values" inputs to each attention head are trainable model parameters, as opposed to being linear projections of a previous layer. These parameters are scaled up by a factor of $\sqrt{d_{model}}$ in order to be more similar to activations. | |
In our first experiment, we replaced each position-wise feed-forward network with a multi-head-attention-over-parameters sublayer with $h_p=8$ heads, key-dimensionality $d_{pk}=64$, and value-dimensionality $d_{pv}=64$, using $n_p=1536$ key-value pairs for each attention head. The sublayer has a total of $2097152$ parameters, including the parameters in the query projection and the output projection. This matches the number of parameters in the position-wise feed-forward network that we replaced. While the theoretical amount of computation is also the same, in practice, the attention version caused the step times to be about 30\% longer. | |
In our second experiment, we used $h_p=8$ heads, and $n_p=512$ key-value pairs for each attention head, again matching the total number of parameters in the base model. | |
Results for the first experiment were slightly worse than for the base model, and results for the second experiment were slightly better, see Table~\ref{tab:parameter_attention}. | |
\begin{table}[h] | |
\caption{Replacing the position-wise feed-forward networks with multihead-attention-over-parameters produces similar results to the base model. All metrics are on the English-to-German translation development set, newstest2013.} | |
\label{tab:parameter_attention} | |
\begin{center} | |
\vspace{-2mm} | |
%\scalebox{1.0}{ | |
\begin{tabular}{c|cccccc|cccc} | |
\hline\rule{0pt}{2.0ex} | |
& \multirow{2}{*}{$\dmodel$} & \multirow{2}{*}{$\dff$} & | |
\multirow{2}{*}{$h_p$} & \multirow{2}{*}{$d_{pk}$} & \multirow{2}{*}{$d_{pv}$} & | |
\multirow{2}{*}{$n_p$} & | |
PPL & BLEU & params & training\\ | |
& & & & & & & (dev) & (dev) & $\times10^6$ & time \\ | |
\hline\rule{0pt}{2.0ex} | |
base & 512 & 2048 & & & & & 4.92 & 25.8 & 65 & 12 hours\\ | |
\hline\rule{0pt}{2.0ex} | |
AOP$_1$ & 512 & & 8 & 64 & 64 & 1536 & 4.92& 25.5 & 65 & 16 hours\\ | |
AOP$_2$ & 512 & & 16 & 64 & 64 & 512 & \textbf{4.86} & \textbf{25.9} & 65 & 16 hours \\ | |
\hline | |
\end{tabular} | |
%} | |
\end{center} | |
\end{table} | |