|
Abstract
|
|
|
|
The dominant sequence transduction models are based on complex recurrent or
|
|
convolutional neural networks that include an encoder and a decoder. The best
|
|
performing models also connect the encoder and decoder through an attention
|
|
mechanism. We propose a new simple network architecture, the Transformer,
|
|
based solely on attention mechanisms, dispensing with recurrence and convolutions
|
|
entirely. Experiments on two machine translation tasks show these models to
|
|
be superior in quality while being more parallelizable and requiring significantly
|
|
|
|
less time to train. Our model achieves 28.4 BLEU on the WMT 2014 English-
|
|
to-German translation task, improving over the existing best results, including
|
|
|
|
ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task,
|
|
our model establishes a new single-model state-of-the-art BLEU score of 41.0 after
|
|
training for 3.5 days on eight GPUs, a small fraction of the training costs of the
|
|
best models from the literature.
|
|
1 Introduction
|
|
Recurrent neural networks, long short-term memory [12] and gated recurrent [7] neural networks
|
|
in particular, have been firmly established as state of the art approaches in sequence modeling and
|
|
transduction problems such as language modeling and machine translation [29, 2, 5]. Numerous
|
|
efforts have since continued to push the boundaries of recurrent language models and encoder-decoder
|
|
architectures [31, 21, 13].
|
|
∗Equal contribution. Listing order is random. Jakob proposed replacing RNNs with self-attention and started
|
|
the effort to evaluate this idea. Ashish, with Illia, designed and implemented the first Transformer models and
|
|
has been crucially involved in every aspect of this work. Noam proposed scaled dot-product attention, multi-head
|
|
attention and the parameter-free position representation and became the other person involved in nearly every
|
|
detail. Niki designed, implemented, tuned and evaluated countless model variants in our original codebase and
|
|
tensor2tensor. Llion also experimented with novel model variants, was responsible for our initial codebase, and
|
|
efficient inference and visualizations. Lukasz and Aidan spent countless long days designing various parts of and
|
|
implementing tensor2tensor, replacing our earlier codebase, greatly improving results and massively accelerating
|
|
our research.
|
|
†Work performed while at Google Brain.
|
|
‡Work performed while at Google Research.
|
|
31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.
|
|
|
|
Recurrent models typically factor computation along the symbol positions of the input and output
|
|
sequences. Aligning the positions to steps in computation time, they generate a sequence of hidden
|
|
states ht, as a function of the previous hidden state ht−1 and the input for position t. This inherently
|
|
sequential nature precludes parallelization within training examples, which becomes critical at longer
|
|
sequence lengths, as memory constraints limit batching across examples. Recent work has achieved
|
|
significant improvements in computational efficiency through factorization tricks [18] and conditional
|
|
computation [26], while also improving model performance in case of the latter. The fundamental
|
|
constraint of sequential computation, however, remains.
|
|
|
|
Attention mechanisms have become an integral part of compelling sequence modeling and transduc-
|
|
tion models in various tasks, allowing modeling of dependencies without regard to their distance in
|
|
|
|
the input or output sequences [2, 16]. In all but a few cases [22], however, such attention mechanisms
|
|
are used in conjunction with a recurrent network.
|
|
In this work we propose the Transformer, a model architecture eschewing recurrence and instead
|
|
relying entirely on an attention mechanism to draw global dependencies between input and output.
|
|
The Transformer allows for significantly more parallelization and can reach a new state of the art in
|
|
translation quality after being trained for as little as twelve hours on eight P100 GPUs.
|
|
2 Background
|
|
The goal of reducing sequential computation also forms the foundation of the Extended Neural GPU
|
|
[20], ByteNet [15] and ConvS2S [8], all of which use convolutional neural networks as basic building
|
|
block, computing hidden representations in parallel for all input and output positions. In these models,
|
|
the number of operations required to relate signals from two arbitrary input or output positions grows
|
|
in the distance between positions, linearly for ConvS2S and logarithmically for ByteNet. This makes
|
|
it more difficult to learn dependencies between distant positions [11]. In the Transformer this is
|
|
reduced to a constant number of operations, albeit at the cost of reduced effective resolution due
|
|
to averaging attention-weighted positions, an effect we counteract with Multi-Head Attention as
|
|
described in section 3.2.
|
|
Self-attention, sometimes called intra-attention is an attention mechanism relating different positions
|
|
of a single sequence in order to compute a representation of the sequence. Self-attention has been
|
|
used successfully in a variety of tasks including reading comprehension, abstractive summarization,
|
|
textual entailment and learning task-independent sentence representations [4, 22, 23, 19].
|
|
|
|
End-to-end memory networks are based on a recurrent attention mechanism instead of sequence-
|
|
aligned recurrence and have been shown to perform well on simple-language question answering and
|
|
|
|
language modeling tasks [28].
|
|
To the best of our knowledge, however, the Transformer is the first transduction model relying
|
|
|
|
entirely on self-attention to compute representations of its input and output without using sequence-
|
|
aligned RNNs or convolution. In the following sections, we will describe the Transformer, motivate
|
|
|
|
self-attention and discuss its advantages over models such as [14, 15] and [8]. |