STATISTICAL LEARNING 2.1.3.3 19 Transformer Network A Transformer [473] is a sequence-to-sequence model that uses an attention mechanism to capture long-range dependencies in the input sequence, benefiting from increased parallelization. Traditionally, it consists of an encoder and a decoder, each composed of multiple layers of self-attention and feed-forward layers. Attention is a mechanism that allows for soft selection of relevant information from a set of candidates, e.g., tokens in a document, based on a query, e.g., a token in the document. The scaled dot-product P attention is defined n for a sequence of length n as follows: Att(Q, K, V ) = i=1 αi Vi . It utilizes three learnable weight matrices, each multiplied with all token embeddings in a sequence to build queries Q ∈ Rn×dq , keys K ∈ Rn×dq , and values V ∈ Rn×dv . The output of the attention mechanism is a weighted sum of the unnormalized values, where each attention weight of the i-th key is computed by normalizing exp(QT i Ki ) the dot product between the query and key vectors αi = Pn exp(Q T K ) . For j=1 J j training stability, the dot product is typically scaled by the square root of the dimensionality of the query and key vectors. This is followed by a feed-forward layer to capture non-linear relationships between the tokens in the sequence. There exist different forms of attention, depending on the type of relationship that is captured. Self-attention computes the attention of each token w.r.t. all other tokens in the sequence, which changes the representation of each token based on the other tokens in the sequence. Multi-head attention is a set of h attention layers, which every Transformer uses to concurrently capture different types of relationships, concatenated together after the parallelized processing. Cross-attention computes the attention of each token in one sequence w.r.t. all tokens in another sequence, which is used in encoder-decoder Transformer architectures for e.g., summarization and machine translation. Specific to decoder layers, masked attention is used to prevent the decoder from attending to future tokens in the sequence by masking the upper triangle of the attention matrix calculation. A major downside to Transformers is the quadratic complexity of the attention mechanism (Figure 2.3), which makes them computationally inefficient for long sequences. This has been addressed by a wealth of techniques [120], such as sparsifing attention, targeting recurrence, downsampling, random or low-rank approximations. Position Embeddings are indispensable for Transformers to be able to process sequences, as they do not have any notion of order or position of tokens in a sequence. The most common type of position embedding is a sinusoidal