# Transformers-based Encoder-Decoder Models

#
**Transformer-based Encoder-Decoder Models**

```
!pip install transformers==4.2.1
!pip install sentencepiece==0.1.95
```

The *transformer-based* encoder-decoder model was introduced by Vaswani
et al. in the famous Attention is all you need
paper and is today the *de-facto*
standard encoder-decoder architecture in natural language processing
(NLP).

Recently, there has been a lot of research on different *pre-training*
objectives for transformer-based encoder-decoder models, *e.g.* T5,
Bart, Pegasus, ProphetNet, Marge, *etc*..., but the model architecture
has stayed largely the same.

The goal of the blog post is to give an **in-detail** explanation of
**how** the transformer-based encoder-decoder architecture models
*sequence-to-sequence* problems. We will focus on the mathematical model
defined by the architecture and how the model can be used in inference.
Along the way, we will give some background on sequence-to-sequence
models in NLP and break down the *transformer-based* encoder-decoder
architecture into its **encoder** and **decoder** parts. We provide many
illustrations and establish the link between the theory of
*transformer-based* encoder-decoder models and their practical usage in
🤗Transformers for inference. Note that this blog post does *not* explain
how such models can be trained - this will be the topic of a future blog
post.

Transformer-based encoder-decoder models are the result of years of
research on *representation learning* and *model architectures*. This
notebook provides a short summary of the history of neural
encoder-decoder models. For more context, the reader is advised to read
this awesome blog
post by
Sebastion Ruder. Additionally, a basic understanding of the
*self-attention architecture* is recommended. The following blog post by
Jay Alammar serves as a good refresher on the original Transformer model
here.

At the time of writing this notebook, 🤗Transformers comprises the
encoder-decoder models *T5*, *Bart*, *MarianMT*, and *Pegasus*, which
are summarized in the docs under model
summaries.

The notebook is divided into four parts:

**Background**-*A short history of neural encoder-decoder models is given with a focus on RNN-based models.***Encoder-Decoder**-*The transformer-based encoder-decoder model is presented and it is explained how the model is used for inference.***Encoder**-*The encoder part of the model is explained in detail.***Decoder**-*The decoder part of the model is explained in detail.*

Each part builds upon the previous part, but can also be read on its own.

##
**Background**

Tasks in natural language generation (NLG), a subfield of NLP, are best
expressed as sequence-to-sequence problems. Such tasks can be defined as
finding a model that maps a sequence of input words to a sequence of
target words. Some classic examples are *summarization* and
*translation*. In the following, we assume that each word is encoded
into a vector representation. $n$ input words can then be represented as
a sequence of $n$ input vectors:

$\mathbf{X}_{1:n} = \{\mathbf{x}_1, \ldots, \mathbf{x}_n\}.$

Consequently, sequence-to-sequence problems can be solved by finding a mapping $f$ from an input sequence of $n$ vectors $\mathbf{X}_{1:n}$ to a sequence of $m$ target vectors $\mathbf{Y}_{1:m}$, whereas the number of target vectors $m$ is unknown apriori and depends on the input sequence:

$f: \mathbf{X}_{1:n} \to \mathbf{Y}_{1:m}.$

Sutskever et al. (2014) noted that deep neural networks (DNN)s, "*despite their flexibility and power can only define a mapping whose inputs and targets can be sensibly encoded with vectors of fixed dimensionality.*" ${}^1$

Using a DNN model ${}^2$ to solve sequence-to-sequence problems would
therefore mean that the number of target vectors $m$ has to be known
*apriori* and would have to be independent of the input $\mathbf{X}_{1:n}$. This is suboptimal because, for tasks in NLG, the
number of target words usually depends on the input $\mathbf{X}_{1:n}$
and not just on the input length $n$. *E.g.*, an article of 1000 words
can be summarized to both 200 words and 100 words depending on its
content.

In 2014, Cho et al. and
Sutskever et al. proposed to use an
encoder-decoder model purely based on recurrent neural networks (RNNs)
for *sequence-to-sequence* tasks. In contrast to DNNS, RNNs are capable
of modeling a mapping to a variable number of target vectors. Let's
dive a bit deeper into the functioning of RNN-based encoder-decoder
models.

During inference, the encoder RNN encodes an input sequence $\mathbf{X}_{1:n}$ by successively updating its *hidden state* ${}^3$.
After having processed the last input vector $\mathbf{x}_n$, the
encoder's hidden state defines the input encoding $\mathbf{c}$. Thus,
the encoder defines the mapping:

$f_{\theta_{enc}}: \mathbf{X}_{1:n} \to \mathbf{c}.$

Then, the decoder's hidden state is initialized with the input encoding and during inference, the decoder RNN is used to auto-regressively generate the target sequence. Let's explain.

Mathematically, the decoder defines the probability distribution of a target sequence $\mathbf{Y}_{1:m}$ given the hidden state $\mathbf{c}$:

$p_{\theta_{dec}}(\mathbf{Y}_{1:m} |\mathbf{c}).$

By Bayes' rule the distribution can be decomposed into conditional distributions of single target vectors as follows:

$p_{\theta_{dec}}(\mathbf{Y}_{1:m} |\mathbf{c}) = \prod_{i=1}^{m} p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{c}).$

Thus, if the architecture can model the conditional distribution of the next target vector, given all previous target vectors:

$p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{c}), \forall i \in \{1, \ldots, m\},$

then it can model the distribution of any target vector sequence given the hidden state $\mathbf{c}$ by simply multiplying all conditional probabilities.

So how does the RNN-based decoder architecture model $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{c})$?

In computational terms, the model sequentially maps the previous inner
hidden state $\mathbf{c}_{i-1}$ and the previous target vector $\mathbf{y}_{i-1}$ to the current inner hidden state $\mathbf{c}_i$ and a
*logit vector* $\mathbf{l}_i$ (shown in dark red below):

$f_{\theta_{\text{dec}}}(\mathbf{y}_{i-1}, \mathbf{c}_{i-1}) \to \mathbf{l}_i, \mathbf{c}_i.$
$\mathbf{c}_0$ is thereby defined as $\mathbf{c}$ being the output
hidden state of the RNN-based encoder. Subsequently, the *softmax*
operation is used to transform the logit vector $\mathbf{l}_i$ to a
conditional probablity distribution of the next target vector:

$p(\mathbf{y}_i | \mathbf{l}_i) = \textbf{Softmax}(\mathbf{l}_i), \text{ with } \mathbf{l}_i = f_{\theta_{\text{dec}}}(\mathbf{y}_{i-1}, \mathbf{c}_{\text{prev}}).$

For more detail on the logit vector and the resulting probability
distribution, please see footnote ${}^4$. From the above equation, we
can see that the distribution of the current target vector $\mathbf{y}_i$ is directly conditioned on the previous target vector $\mathbf{y}_{i-1}$ and the previous hidden state $\mathbf{c}_{i-1}$.
Because the previous hidden state $\mathbf{c}_{i-1}$ depends on all
previous target vectors $\mathbf{y}_0, \ldots, \mathbf{y}_{i-2}$, it can
be stated that the RNN-based decoder *implicitly* (*e.g.* *indirectly*)
models the conditional distribution $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{c})$.

The space of possible target vector sequences $\mathbf{Y}_{1:m}$ is prohibitively large so that at inference, one has to rely on decoding methods ${}^5$ that efficiently sample high probability target vector sequences from $p_{\theta_{dec}}(\mathbf{Y}_{1:m} |\mathbf{c})$.

Given such a decoding method, during inference, the next input vector $\mathbf{y}_i$ can then be sampled from $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{c})$
and is consequently appended to the input sequence so that the decoder
RNN then models $p_{\theta_{\text{dec}}}(\mathbf{y}_{i+1} | \mathbf{Y}_{0: i}, \mathbf{c})$
to sample the next input vector $\mathbf{y}_{i+1}$ and so on in an
*auto-regressive* fashion.

An important feature of RNN-based encoder-decoder models is the
definition of *special* vectors, such as the $\text{EOS}$ and $\text{BOS}$ vector. The $\text{EOS}$ vector often represents the final
input vector $\mathbf{x}_n$ to "cue" the encoder that the input
sequence has ended and also defines the end of the target sequence. As
soon as the $\text{EOS}$ is sampled from a logit vector, the generation
is complete. The $\text{BOS}$ vector represents the input vector $\mathbf{y}_0$ fed to the decoder RNN at the very first decoding step.
To output the first logit $\mathbf{l}_1$, an input is required and since
no input has been generated at the first step a special $\text{BOS}$
input vector is fed to the decoder RNN. Ok - quite complicated! Let's
illustrate and walk through an example.

The unfolded RNN encoder is colored in green and the unfolded RNN decoder is colored in red.

The English sentence "I want to buy a car", represented by $\mathbf{x}_1 = \text{I}$, $\mathbf{x}_2 = \text{want}$, $\mathbf{x}_3 = \text{to}$, $\mathbf{x}_4 = \text{buy}$, $\mathbf{x}_5 = \text{a}$, $\mathbf{x}_6 = \text{car}$ and $\mathbf{x}_7 = \text{EOS}$ is translated into German: "Ich will ein
Auto kaufen" defined as $\mathbf{y}_0 = \text{BOS}$, $\mathbf{y}_1 = \text{Ich}$, $\mathbf{y}_2 = \text{will}$, $\mathbf{y}_3 = \text{ein}$, $\mathbf{y}_4 = \text{Auto}, \mathbf{y}_5 = \text{kaufen}$ and $\mathbf{y}_6=\text{EOS}$. To begin with, the input vector $\mathbf{x}_1 = \text{I}$ is processed by the encoder RNN and updates
its hidden state. Note that because we are only interested in the final
encoder's hidden state $\mathbf{c}$, we can disregard the RNN
encoder's target vector. The encoder RNN then processes the rest of the
input sentence $\text{want}$, $\text{to}$, $\text{buy}$, $\text{a}$, $\text{car}$, $\text{EOS}$ in the same fashion, updating its hidden
state at each step until the vector $\mathbf{x}_7={EOS}$ is reached ${}^6$. In the illustration above the horizontal arrow connecting the
unfolded encoder RNN represents the sequential updates of the hidden
state. The final hidden state of the encoder RNN, represented by $\mathbf{c}$ then completely defines the *encoding* of the input
sequence and is used as the initial hidden state of the decoder RNN.
This can be seen as *conditioning* the decoder RNN on the encoded input.

To generate the first target vector, the decoder is fed the $\text{BOS}$
vector, illustrated as $\mathbf{y}_0$ in the design above. The target
vector of the RNN is then further mapped to the logit vector $\mathbf{l}_1$ by means of the *LM Head* feed-forward layer to define
the conditional distribution of the first target vector as explained
above:

$p_{\theta_{dec}}(\mathbf{y} | \text{BOS}, \mathbf{c}).$

The word $\text{Ich}$ is sampled (shown by the grey arrow, connecting $\mathbf{l}_1$ and $\mathbf{y}_1$) and consequently the second target vector can be sampled:

$\text{will} \sim p_{\theta_{dec}}(\mathbf{y} | \text{BOS}, \text{Ich}, \mathbf{c}).$

And so on until at step $i=6$, the $\text{EOS}$ vector is sampled from $\mathbf{l}_6$ and the decoding is finished. The resulting target sequence amounts to $\mathbf{Y}_{1:6} = \{\mathbf{y}_1, \ldots, \mathbf{y}_6\}$, which is "Ich will ein Auto kaufen" in our example above.

To sum it up, an RNN-based encoder-decoder model, represented by $f_{\theta_{\text{enc}}}$ and $p_{\theta_{\text{dec}}}$ defines the distribution $p(\mathbf{Y}_{1:m} | \mathbf{X}_{1:n})$ by factorization:

$p_{\theta_{\text{enc}}, \theta_{\text{dec}}}(\mathbf{Y}_{1:m} | \mathbf{X}_{1:n}) = \prod_{i=1}^{m} p_{\theta_{\text{enc}}, \theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{X}_{1:n}) = \prod_{i=1}^{m} p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{c}), \text{ with } \mathbf{c}=f_{\theta_{enc}}(X).$

During inference, efficient decoding methods can auto-regressively generate the target sequence $\mathbf{Y}_{1:m}$.

The RNN-based encoder-decoder model took the NLG community by storm. In 2016, Google announced to fully replace its heavily feature engineered translation service by a single RNN-based encoder-decoder model (see here).

Nevertheless, RNN-based encoder-decoder models have two pitfalls. First,
RNNs suffer from the vanishing gradient problem, making it very
difficult to capture long-range dependencies, *cf.* Hochreiter et al.
(2001). Second,
the inherent recurrent architecture of RNNs prevents efficient
parallelization when encoding, *cf.* Vaswani et al.
(2017).

${}^1$ The original quote from the paper is "*Despite their flexibility
and power, DNNs can only be applied to problems whose inputs and targets
can be sensibly encoded with vectors of fixed dimensionality*", which
is slightly adapted here.

${}^2$ The same holds essentially true for convolutional neural networks (CNNs). While an input sequence of variable length can be fed into a CNN, the dimensionality of the target will always be dependent on the input dimensionality or fixed to a specific value.

${}^3$ At the first step, the hidden state is initialized as a zero vector and fed to the RNN together with the first input vector $\mathbf{x}_1$.

${}^4$ A neural network can define a probability distribution over all
words, *i.e.* $p(\mathbf{y} | \mathbf{c}, \mathbf{Y}_{0: i-1})$ as
follows. First, the network defines a mapping from the inputs $\mathbf{c}, \mathbf{Y}_{0: i-1}$ to an embedded vector representation $\mathbf{y'}$, which corresponds to the RNN target vector. The embedded
vector representation $\mathbf{y'}$ is then passed to the "language
model head" layer, which means that it is multiplied by the *word
embedding matrix*, *i.e.* $\mathbf{Y}^{\text{vocab}}$, so that a score
between $\mathbf{y'}$ and each encoded vector $\mathbf{y} \in \mathbf{Y}^{\text{vocab}}$ is computed. The resulting
vector is called the logit vector $\mathbf{l} = \mathbf{Y}^{\text{vocab}} \mathbf{y'}$ and can be
mapped to a probability distribution over all words by applying a
softmax operation: $p(\mathbf{y} | \mathbf{c}) = \text{Softmax}(\mathbf{Y}^{\text{vocab}} \mathbf{y'}) = \text{Softmax}(\mathbf{l})$.

${}^5$ Beam-search decoding is an example of such a decoding method. Different decoding methods are out of scope for this notebook. The reader is advised to refer to this interactive notebook on decoding methods.

${}^6$ Sutskever et al. (2014) reverses the order of the input so that in the above example the input vectors would correspond to $\mathbf{x}_1 = \text{car}$, $\mathbf{x}_2 = \text{a}$, $\mathbf{x}_3 = \text{buy}$, $\mathbf{x}_4 = \text{to}$, $\mathbf{x}_5 = \text{want}$, $\mathbf{x}_6 = \text{I}$ and $\mathbf{x}_7 = \text{EOS}$. The motivation is to allow for a shorter connection between corresponding word pairs such as $\mathbf{x}_6 = \text{I}$ and $\mathbf{y}_1 = \text{Ich}$. The research group emphasizes that the reversal of the input sequence was a key reason for their model's improved performance on machine translation.

##
**Encoder-Decoder**

In 2017, Vaswani et al. introduced the **Transformer** and thereby gave
birth to *transformer-based* encoder-decoder models.

Analogous to RNN-based encoder-decoder models, transformer-based
encoder-decoder models consist of an encoder and a decoder which are
both stacks of *residual attention blocks*. The key innovation of
transformer-based encoder-decoder models is that such residual attention
blocks can process an input sequence $\mathbf{X}_{1:n}$ of variable
length $n$ without exhibiting a recurrent structure. Not relying on a
recurrent structure allows transformer-based encoder-decoders to be
highly parallelizable, which makes the model orders of magnitude more
computationally efficient than RNN-based encoder-decoder models on
modern hardware.

As a reminder, to solve a *sequence-to-sequence* problem, we need to
find a mapping of an input sequence $\mathbf{X}_{1:n}$ to an output
sequence $\mathbf{Y}_{1:m}$ of variable length $m$. Let's see how
transformer-based encoder-decoder models are used to find such a
mapping.

Similar to RNN-based encoder-decoder models, the transformer-based encoder-decoder models define a conditional distribution of target vectors $\mathbf{Y}_{1:n}$ given an input sequence $\mathbf{X}_{1:n}$:

$p_{\theta_{\text{enc}}, \theta_{\text{dec}}}(\mathbf{Y}_{1:m} | \mathbf{X}_{1:n}).$

The transformer-based encoder part encodes the input sequence $\mathbf{X}_{1:n}$ to a *sequence* of *hidden states* $\mathbf{\overline{X}}_{1:n}$, thus defining the mapping:

$f_{\theta_{\text{enc}}}: \mathbf{X}_{1:n} \to \mathbf{\overline{X}}_{1:n}.$

The transformer-based decoder part then models the conditional probability distribution of the target vector sequence $\mathbf{Y}_{1:n}$ given the sequence of encoded hidden states $\mathbf{\overline{X}}_{1:n}$:

$p_{\theta_{dec}}(\mathbf{Y}_{1:n} | \mathbf{\overline{X}}_{1:n}).$

By Bayes' rule, this distribution can be factorized to a product of conditional probability distribution of the target vector $\mathbf{y}_i$ given the encoded hidden states $\mathbf{\overline{X}}_{1:n}$ and all previous target vectors $\mathbf{Y}_{0:i-1}$:

$p_{\theta_{dec}}(\mathbf{Y}_{1:n} | \mathbf{\overline{X}}_{1:n}) = \prod_{i=1}^{n} p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}).$

The transformer-based decoder hereby maps the sequence of encoded hidden
states $\mathbf{\overline{X}}_{1:n}$ and all previous target vectors $\mathbf{Y}_{0:i-1}$ to the *logit* vector $\mathbf{l}_i$. The logit
vector $\mathbf{l}_i$ is then processed by the *softmax* operation to
define the conditional distribution $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n})$,
just as it is done for RNN-based decoders. However, in contrast to
RNN-based decoders, the distribution of the target vector $\mathbf{y}_i$
is *explicitly* (or directly) conditioned on all previous target vectors $\mathbf{y}_0, \ldots, \mathbf{y}_{i-1}$ as we will see later in more
detail. The 0th target vector $\mathbf{y}_0$ is hereby represented by a
special "begin-of-sentence" $\text{BOS}$ vector.

Having defined the conditional distribution $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n})$,
we can now *auto-regressively* generate the output and thus define a
mapping of an input sequence $\mathbf{X}_{1:n}$ to an output sequence $\mathbf{Y}_{1:m}$ at inference.

Let's visualize the complete process of *auto-regressive* generation of
*transformer-based* encoder-decoder models.

The transformer-based encoder is colored in green and the transformer-based decoder is colored in red. As in the previous section, we show how the English sentence "I want to buy a car", represented by $\mathbf{x}_1 = \text{I}$, $\mathbf{x}_2 = \text{want}$, $\mathbf{x}_3 = \text{to}$, $\mathbf{x}_4 = \text{buy}$, $\mathbf{x}_5 = \text{a}$, $\mathbf{x}_6 = \text{car}$, and $\mathbf{x}_7 = \text{EOS}$ is translated into German: "Ich will ein Auto kaufen" defined as $\mathbf{y}_0 = \text{BOS}$, $\mathbf{y}_1 = \text{Ich}$, $\mathbf{y}_2 = \text{will}$, $\mathbf{y}_3 = \text{ein}$, $\mathbf{y}_4 = \text{Auto}, \mathbf{y}_5 = \text{kaufen}$, and $\mathbf{y}_6=\text{EOS}$.

To begin with, the encoder processes the complete input sequence $\mathbf{X}_{1:7}$ = "I want to buy a car" (represented by the light
green vectors) to a contextualized encoded sequence $\mathbf{\overline{X}}_{1:7}$. *E.g.* $\mathbf{\overline{x}}_4$ defines
an encoding that depends not only on the input $\mathbf{x}_4$ = "buy",
but also on all other words "I", "want", "to", "a", "car" and
"EOS", *i.e.* the context.

Next, the input encoding $\mathbf{\overline{X}}_{1:7}$ together with the
BOS vector, *i.e.* $\mathbf{y}_0$, is fed to the decoder. The decoder
processes the inputs $\mathbf{\overline{X}}_{1:7}$ and $\mathbf{y}_0$ to
the first logit $\mathbf{l}_1$ (shown in darker red) to define the
conditional distribution of the first target vector $\mathbf{y}_1$:

$p_{\theta_{enc, dec}}(\mathbf{y} | \mathbf{y}_0, \mathbf{X}_{1:7}) = p_{\theta_{enc, dec}}(\mathbf{y} | \text{BOS}, \text{I want to buy a car EOS}) = p_{\theta_{dec}}(\mathbf{y} | \text{BOS}, \mathbf{\overline{X}}_{1:7}).$

Next, the first target vector $\mathbf{y}_1$ = $\text{Ich}$ is sampled from the distribution (represented by the grey arrows) and can now be fed to the decoder again. The decoder now processes both $\mathbf{y}_0$ = "BOS" and $\mathbf{y}_1$ = "Ich" to define the conditional distribution of the second target vector $\mathbf{y}_2$:

$p_{\theta_{dec}}(\mathbf{y} | \text{BOS Ich}, \mathbf{\overline{X}}_{1:7}).$

We can sample again and produce the target vector $\mathbf{y}_2$ = "will". We continue in auto-regressive fashion until at step 6 the EOS vector is sampled from the conditional distribution:

$\text{EOS} \sim p_{\theta_{dec}}(\mathbf{y} | \text{BOS Ich will ein Auto kaufen}, \mathbf{\overline{X}}_{1:7}).$

And so on in auto-regressive fashion.

It is important to understand that the encoder is only used in the first forward pass to map $\mathbf{X}_{1:n}$ to $\mathbf{\overline{X}}_{1:n}$. As of the second forward pass, the decoder can directly make use of the previously calculated encoding $\mathbf{\overline{X}}_{1:n}$. For clarity, let's illustrate the first and the second forward pass for our example above.

As can be seen, only in step $i=1$ do we have to encode "I want to buy a car EOS" to $\mathbf{\overline{X}}_{1:7}$. At step $i=2$, the contextualized encodings of "I want to buy a car EOS" are simply reused by the decoder.

In 🤗Transformers, this auto-regressive generation is done under-the-hood
when calling the `.generate()`

method. Let's use one of our translation
models to see this in action.

```
from transformers import MarianMTModel, MarianTokenizer
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")
# create ids of encoded input vectors
input_ids = tokenizer("I want to buy a car", return_tensors="pt").input_ids
# translate example
output_ids = model.generate(input_ids)[0]
# decode and print
print(tokenizer.decode(output_ids))
```

*Output:*

```
<pad> Ich will ein Auto kaufen
```

Calling `.generate()`

does many things under-the-hood. First, it passes
the `input_ids`

to the encoder. Second, it passes a pre-defined token, which is the $\text{<pad>}$ symbol in the case of
`MarianMTModel`

along with the encoded `input_ids`

to the decoder.
Third, it applies the beam search decoding mechanism to
auto-regressively sample the next output word of the *last* decoder
output ${}^1$. For more detail on how beam search decoding works, one is
advised to read this blog
post.

In the Appendix, we have included a code snippet that shows how a simple
generation method can be implemented "from scratch". To fully
understand how *auto-regressive* generation works under-the-hood, it is
highly recommended to read the Appendix.

To sum it up:

- The transformer-based encoder defines a mapping from the input sequence $\mathbf{X}_{1:n}$ to a contextualized encoding sequence $\mathbf{\overline{X}}_{1:n}$.
- The transformer-based decoder defines the conditional distribution $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n})$.
- Given an appropriate decoding mechanism, the output sequence $\mathbf{Y}_{1:m}$ can auto-regressively be sampled from $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}), \forall i \in \{1, \ldots, m\}$.

Great, now that we have gotten a general overview of how
*transformer-based* encoder-decoder models work, we can dive deeper into
both the encoder and decoder part of the model. More specifically, we
will see exactly how the encoder makes use of the self-attention layer
to yield a sequence of context-dependent vector encodings and how
self-attention layers allow for efficient parallelization. Then, we will
explain in detail how the self-attention layer works in the decoder
model and how the decoder is conditioned on the encoder's output with
*cross-attention* layers to define the conditional distribution $p_{\theta_{\text{dec}}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n})$.
Along, the way it will become obvious how transformer-based
encoder-decoder models solve the long-range dependencies problem of
RNN-based encoder-decoder models.

${}^1$ In the case of `"Helsinki-NLP/opus-mt-en-de"`

, the decoding
parameters can be accessed
here,
where we can see that model applies beam search with `num_beams=6`

.

##
**Encoder**

As mentioned in the previous section, the *transformer-based* encoder
maps the input sequence to a contextualized encoding sequence:

$f_{\theta_{\text{enc}}}: \mathbf{X}_{1:n} \to \mathbf{\overline{X}}_{1:n}.$

Taking a closer look at the architecture, the transformer-based encoder
is a stack of residual *encoder blocks*. Each encoder block consists of
a **bi-directional** self-attention layer, followed by two feed-forward
layers. For simplicity, we disregard the normalization layers in this
notebook. Also, we will not further discuss the role of the two
feed-forward layers, but simply see it as a final vector-to-vector
mapping required in each encoder block ${}^1$. The bi-directional
self-attention layer puts each input vector $\mathbf{x'}_j, \forall j \in \{1, \ldots, n\}$ into relation with all
input vectors $\mathbf{x'}_1, \ldots, \mathbf{x'}_n$ and by doing so
transforms the input vector $\mathbf{x'}_j$ to a more "refined"
contextual representation of itself, defined as $\mathbf{x''}_j$.
Thereby, the first encoder block transforms each input vector of the
input sequence $\mathbf{X}_{1:n}$ (shown in light green below) from a
*context-independent* vector representation to a *context-dependent*
vector representation, and the following encoder blocks further refine
this contextual representation until the last encoder block outputs the
final contextual encoding $\mathbf{\overline{X}}_{1:n}$ (shown in darker
green below).

Let's visualize how the encoder processes the input sequence "I want to buy a car EOS" to a contextualized encoding sequence. Similar to RNN-based encoders, transformer-based encoders also add a special "end-of-sequence" input vector to the input sequence to hint to the model that the input vector sequence is finished ${}^2$.

Our exemplary *transformer-based* encoder is composed of three encoder
blocks, whereas the second encoder block is shown in more detail in the
red box on the right for the first three input vectors $\mathbf{x}_1, \mathbf{x}_2 and \mathbf{x}_3$. The bi-directional
self-attention mechanism is illustrated by the fully-connected graph in
the lower part of the red box and the two feed-forward layers are shown
in the upper part of the red box. As stated before, we will focus only
on the bi-directional self-attention mechanism.

As can be seen each output vector of the self-attention layer $\mathbf{x''}_i, \forall i \in \{1, \ldots, 7\}$ depends *directly* on
*all* input vectors $\mathbf{x'}_1, \ldots, \mathbf{x'}_7$. This means,
*e.g.* that the input vector representation of the word "want", *i.e.* $\mathbf{x'}_2$, is put into direct relation with the word "buy",
*i.e.* $\mathbf{x'}_4$, but also with the word "I",*i.e.* $\mathbf{x'}_1$. The output vector representation of "want", *i.e.* $\mathbf{x''}_2$, thus represents a more refined contextual
representation for the word "want".

Let's take a deeper look at how bi-directional self-attention works. Each input vector $\mathbf{x'}_i$ of an input sequence $\mathbf{X'}_{1:n}$ of an encoder block is projected to a key vector $\mathbf{k}_i$, value vector $\mathbf{v}_i$ and query vector $\mathbf{q}_i$ (shown in orange, blue, and purple respectively below) through three trainable weight matrices $\mathbf{W}_q, \mathbf{W}_v, \mathbf{W}_k$:

$\mathbf{q}_i = \mathbf{W}_q \mathbf{x'}_i,$ $\mathbf{v}_i = \mathbf{W}_v \mathbf{x'}_i,$ $\mathbf{k}_i = \mathbf{W}_k \mathbf{x'}_i,$ $\forall i \in \{1, \ldots n \}.$

Note, that the **same** weight matrices are applied to each input vector $\mathbf{x}_i, \forall i \in \{i, \ldots, n\}$. After projecting each
input vector $\mathbf{x}_i$ to a query, key, and value vector, each
query vector $\mathbf{q}_j, \forall j \in \{1, \ldots, n\}$ is compared
to all key vectors $\mathbf{k}_1, \ldots, \mathbf{k}_n$. The more
similar one of the key vectors $\mathbf{k}_1, \ldots \mathbf{k}_n$ is to
a query vector $\mathbf{q}_j$, the more important is the corresponding
value vector $\mathbf{v}_j$ for the output vector $\mathbf{x''}_j$. More
specifically, an output vector $\mathbf{x''}_j$ is defined as the
weighted sum of all value vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$
plus the input vector $\mathbf{x'}_j$. Thereby, the weights are
proportional to the cosine similarity between $\mathbf{q}_j$ and the
respective key vectors $\mathbf{k}_1, \ldots, \mathbf{k}_n$, which is
mathematically expressed by $\textbf{Softmax}(\mathbf{K}_{1:n}^\intercal \mathbf{q}_j)$ as
illustrated in the equation below. For a complete description of the
self-attention layer, the reader is advised to take a look at
this blog post or
the original paper.

Alright, this sounds quite complicated. Let's illustrate the
bi-directional self-attention layer for one of the query vectors of our
example above. For simplicity, it is assumed that our exemplary
*transformer-based* decoder uses only a single attention head
`config.num_heads = 1`

and that no normalization is applied.

On the left, the previously illustrated second encoder block is shown
again and on the right, an in detail visualization of the bi-directional
self-attention mechanism is given for the second input vector $\mathbf{x'}_2$ that corresponds to the input word "want". At first
all input vectors $\mathbf{x'}_1, \ldots, \mathbf{x'}_7$ are projected
to their respective query vectors $\mathbf{q}_1, \ldots, \mathbf{q}_7$
(only the first three query vectors are shown in purple above), value
vectors $\mathbf{v}_1, \ldots, \mathbf{v}_7$ (shown in blue), and key
vectors $\mathbf{k}_1, \ldots, \mathbf{k}_7$ (shown in orange). The
query vector $\mathbf{q}_2$ is then multiplied by the transpose of all
key vectors, *i.e.* $\mathbf{K}_{1:7}^{\intercal}$ followed by the
softmax operation to yield the *self-attention weights*. The
self-attention weights are finally multiplied by the respective value
vectors and the input vector $\mathbf{x'}_2$ is added to output the
"refined" representation of the word "want", *i.e.* $\mathbf{x''}_2$
(shown in dark green on the right). The whole equation is illustrated in
the upper part of the box on the right. The multiplication of $\mathbf{K}_{1:7}^{\intercal}$ and $\mathbf{q}_2$ thereby makes it
possible to compare the vector representation of "want" to all other
input vector representations "I", "to", "buy", "a", "car",
"EOS" so that the self-attention weights mirror the importance each of
the other input vector representations $\mathbf{x'}_j \text{, with } j \ne 2$ for the refined representation $\mathbf{x''}_2$ of the word "want".

To further understand the implications of the bi-directional
self-attention layer, let's assume the following sentence is processed:
"*The house is beautiful and well located in the middle of the city
where it is easily accessible by public transport*". The word "it"
refers to "house", which is 12 "positions away". In
transformer-based encoders, the bi-directional self-attention layer
performs a single mathematical operation to put the input vector of
"house" into relation with the input vector of "it" (compare to the
first illustration of this section). In contrast, in an RNN-based
encoder, a word that is 12 "positions away", would require at least 12
mathematical operations meaning that in an RNN-based encoder a linear
number of mathematical operations are required. This makes it much
harder for an RNN-based encoder to model long-range contextual
representations. Also, it becomes clear that a transformer-based encoder
is much less prone to lose important information than an RNN-based
encoder-decoder model because the sequence length of the encoding is
kept the same, *i.e.* $\textbf{len}(\mathbf{X}_{1:n}) = \textbf{len}(\mathbf{\overline{X}}_{1:n}) = n$,
while an RNN compresses the length from $*\textbf{len}((\mathbf{X}_{1:n}) = n$ to just $\textbf{len}(\mathbf{c}) = 1$, which makes it very difficult for RNNs
to effectively encode long-range dependencies between input words.

In addition to making long-range dependencies more easily learnable, we can see that the Transformer architecture is able to process text in parallel.Mathematically, this can easily be shown by writing the self-attention formula as a product of query, key, and value matrices:

$\mathbf{X''}_{1:n} = \mathbf{V}_{1:n} \text{Softmax}(\mathbf{Q}_{1:n}^\intercal \mathbf{K}_{1:n}) + \mathbf{X'}_{1:n}.$

The output $\mathbf{X''}_{1:n} = \mathbf{x''}_1, \ldots, \mathbf{x''}_n$ is computed via a series of matrix multiplications and a softmax operation, which can be parallelized effectively. Note, that in an RNN-based encoder model, the computation of the hidden state $\mathbf{c}$ has to be done sequentially: Compute hidden state of the first input vector $\mathbf{x}_1$, then compute the hidden state of the second input vector that depends on the hidden state of the first hidden vector, etc. The sequential nature of RNNs prevents effective parallelization and makes them much more inefficient compared to transformer-based encoder models on modern GPU hardware.

Great, now we should have a better understanding of a) how transformer-based encoder models effectively model long-range contextual representations and b) how they efficiently process long sequences of input vectors.

Now, let's code up a short example of the encoder part of our
`MarianMT`

encoder-decoder models to verify that the explained theory
holds in practice.

${}^1$ An in-detail explanation of the role the feed-forward layers play
in transformer-based models is out-of-scope for this notebook. It is
argued in Yun et. al, (2017)
that feed-forward layers are crucial to map each contextual vector $\mathbf{x'}_i$ individually to the desired output space, which the
*self-attention* layer does not manage to do on its own. It should be
noted here, that each output token $\mathbf{x'}$ is processed by the
same feed-forward layer. For more detail, the reader is advised to read
the paper.

${}^2$ However, the EOS input vector does not have to be appended to the
input sequence, but has been shown to improve performance in many cases.
In contrast to the *0th* $\text{BOS}$ target vector of the
transformer-based decoder is required as a starting input vector to
predict a first target vector.

```
from transformers import MarianMTModel, MarianTokenizer
import torch
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")
embeddings = model.get_input_embeddings()
# create ids of encoded input vectors
input_ids = tokenizer("I want to buy a car", return_tensors="pt").input_ids
# pass input_ids to encoder
encoder_hidden_states = model.base_model.encoder(input_ids, return_dict=True).last_hidden_state
# change the input slightly and pass to encoder
input_ids_perturbed = tokenizer("I want to buy a house", return_tensors="pt").input_ids
encoder_hidden_states_perturbed = model.base_model.encoder(input_ids_perturbed, return_dict=True).last_hidden_state
# compare shape and encoding of first vector
print(f"Length of input embeddings {embeddings(input_ids).shape[1]}. Length of encoder_hidden_states {encoder_hidden_states.shape[1]}")
# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `I` equal to its perturbed version?: ", torch.allclose(encoder_hidden_states[0, 0], encoder_hidden_states_perturbed[0, 0], atol=1e-3))
```

*Outputs:*

```
Length of input embeddings 7. Length of encoder_hidden_states 7
Is encoding for `I` equal to its perturbed version?: False
```

We compare the length of the input word embeddings, *i.e.*
`embeddings(input_ids)`

corresponding to $\mathbf{X}_{1:n}$, with the
length of the `encoder_hidden_states`

, corresponding to $\mathbf{\overline{X}}_{1:n}$. Also, we have forwarded the word sequence
"I want to buy a car" and a slightly perturbated version "I want to
buy a house" through the encoder to check if the first output encoding,
corresponding to "I", differs when only the last word is changed in
the input sequence.

As expected the output length of the input word embeddings and encoder
output encodings, *i.e.* $\textbf{len}(\mathbf{X}_{1:n})$ and $\textbf{len}(\mathbf{\overline{X}}_{1:n})$, is equal. Second, it can be
noted that the values of the encoded output vector of $\mathbf{\overline{x}}_1 = \text{"I"}$ are different when the last word
is changed from "car" to "house". This however should not come as a
surprise if one has understood bi-directional self-attention.

On a side-note, *autoencoding* models, such as BERT, have the exact same
architecture as *transformer-based* encoder models. *Autoencoding*
models leverage this architecture for massive self-supervised
pre-training on open-domain text data so that they can map any word
sequence to a deep bi-directional representation. In Devlin et al.
(2018), the authors show that a
pre-trained BERT model with a single task-specific classification layer
on top can achieve SOTA results on eleven NLP tasks. All *autoencoding*
models of 🤗Transformers can be found
here.

##
**Decoder**

As mentioned in the *Encoder-Decoder* section, the *transformer-based*
decoder defines the conditional probability distribution of a target
sequence given the contextualized encoding sequence:

$p_{\theta_{dec}}(\mathbf{Y}_{1: m} | \mathbf{\overline{X}}_{1:n}),$

which by Bayes' rule can be decomposed into a product of conditional distributions of the next target vector given the contextualized encoding sequence and all previous target vectors:

$p_{\theta_{dec}}(\mathbf{Y}_{1:m} | \mathbf{\overline{X}}_{1:n}) = \prod_{i=1}^{m} p_{\theta_{dec}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}).$

Let's first understand how the transformer-based decoder defines a
probability distribution. The transformer-based decoder is a stack of
*decoder blocks* followed by a dense layer, the "LM head". The stack
of decoder blocks maps the contextualized encoding sequence $\mathbf{\overline{X}}_{1:n}$ and a target vector sequence prepended by
the $\text{BOS}$ vector and cut to the last target vector, *i.e.* $\mathbf{Y}_{0:i-1}$, to an encoded sequence of target vectors $\mathbf{\overline{Y}}_{0: i-1}$. Then, the "LM head" maps the encoded
sequence of target vectors $\mathbf{\overline{Y}}_{0: i-1}$ to a
sequence of logit vectors $\mathbf{L}_{1:n} = \mathbf{l}_1, \ldots, \mathbf{l}_n$, whereas the
dimensionality of each logit vector $\mathbf{l}_i$ corresponds to the
size of the vocabulary. This way, for each $i \in \{1, \ldots, n\}$ a
probability distribution over the whole vocabulary can be obtained by
applying a softmax operation on $\mathbf{l}_i$. These distributions
define the conditional distribution:

$p_{\theta_{dec}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}), \forall i \in \{1, \ldots, n\},$

respectively. The "LM head" is often tied to the transpose of the word
embedding matrix, *i.e.* $\mathbf{W}_{\text{emb}}^{\intercal} = \left[\mathbf{y}^1, \ldots, \mathbf{y}^{\text{vocab}}\right]^{\intercal}$ ${}^1$. Intuitively this means that for all $i \in \{0, \ldots, n - 1\}$
the "LM Head" layer compares the encoded output vector $\mathbf{\overline{y}}_i$ to all word embeddings in the vocabulary $\mathbf{y}^1, \ldots, \mathbf{y}^{\text{vocab}}$ so that the logit
vector $\mathbf{l}_{i+1}$ represents the similarity scores between the
encoded output vector and each word embedding. The softmax operation
simply transformers the similarity scores to a probability distribution.
For each $i \in \{1, \ldots, n\}$, the following equations hold:

$p_{\theta_{dec}}(\mathbf{y} | \mathbf{\overline{X}}_{1:n}, \mathbf{Y}_{0:i-1})$ $= \text{Softmax}(f_{\theta_{\text{dec}}}(\mathbf{\overline{X}}_{1:n}, \mathbf{Y}_{0:i-1}))$ $= \text{Softmax}(\mathbf{W}_{\text{emb}}^{\intercal} \mathbf{\overline{y}}_{i-1})$ $= \text{Softmax}(\mathbf{l}_i).$

Putting it all together, in order to model the conditional distribution
of a target vector sequence $\mathbf{Y}_{1: m}$, the target vectors $\mathbf{Y}_{1:m-1}$ prepended by the special $\text{BOS}$ vector,
*i.e.* $\mathbf{y}_0$, are first mapped together with the contextualized
encoding sequence $\mathbf{\overline{X}}_{1:n}$ to the logit vector
sequence $\mathbf{L}_{1:m}$. Consequently, each logit target vector $\mathbf{l}_i$ is transformed into a conditional probability
distribution of the target vector $\mathbf{y}_i$ using the softmax
operation. Finally, the conditional probabilities of all target vectors $\mathbf{y}_1, \ldots, \mathbf{y}_m$ multiplied together to yield the
conditional probability of the complete target vector sequence:

$p_{\theta_{dec}}(\mathbf{Y}_{1:m} | \mathbf{\overline{X}}_{1:n}) = \prod_{i=1}^{m} p_{\theta_{dec}}(\mathbf{y}_i | \mathbf{Y}_{0: i-1}, \mathbf{\overline{X}}_{1:n}).$

In contrast to transformer-based encoders, in transformer-based
decoders, the encoded output vector $\mathbf{\overline{y}}_i$ should be
a good representation of the *next* target vector $\mathbf{y}_{i+1}$ and
not of the input vector itself. Additionally, the encoded output vector $\mathbf{\overline{y}}_i$ should be conditioned on all contextualized
encoding sequence $\mathbf{\overline{X}}_{1:n}$. To meet these
requirements each decoder block consists of a **uni-directional**
self-attention layer, followed by a **cross-attention** layer and two
feed-forward layers ${}^2$. The uni-directional self-attention layer
puts each of its input vectors $\mathbf{y'}_j$ only into relation with
all previous input vectors $\mathbf{y'}_i, \text{ with } i \le j$ for
all $j \in \{1, \ldots, n\}$ to model the probability distribution of
the next target vectors. The cross-attention layer puts each of its
input vectors $\mathbf{y''}_j$ into relation with all contextualized
encoding vectors $\mathbf{\overline{X}}_{1:n}$ to condition the
probability distribution of the next target vectors on the input of the
encoder as well.

Alright, let's visualize the *transformer-based* decoder for our
English to German translation example.

We can see that the decoder maps the input $\mathbf{Y}_{0:5}$ "BOS",
"Ich", "will", "ein", "Auto", "kaufen" (shown in light red)
together with the contextualized sequence of "I", "want", "to",
"buy", "a", "car", "EOS", *i.e.* $\mathbf{\overline{X}}_{1:7}$
(shown in dark green) to the logit vectors $\mathbf{L}_{1:6}$ (shown in
dark red).

Applying a softmax operation on each $\mathbf{l}_1, \mathbf{l}_2, \ldots, \mathbf{l}_5$ can thus define the conditional probability distributions:

$p_{\theta_{dec}}(\mathbf{y} | \text{BOS}, \mathbf{\overline{X}}_{1:7}),$ $p_{\theta_{dec}}(\mathbf{y} | \text{BOS Ich}, \mathbf{\overline{X}}_{1:7}),$ $\ldots,$ $p_{\theta_{dec}}(\mathbf{y} | \text{BOS Ich will ein Auto kaufen}, \mathbf{\overline{X}}_{1:7}).$

The overall conditional probability of:

$p_{\theta_{dec}}(\text{Ich will ein Auto kaufen EOS} | \mathbf{\overline{X}}_{1:n})$

can therefore be computed as the following product:

$p_{\theta_{dec}}(\text{Ich} | \text{BOS}, \mathbf{\overline{X}}_{1:7}) \times \ldots \times p_{\theta_{dec}}(\text{EOS} | \text{BOS Ich will ein Auto kaufen}, \mathbf{\overline{X}}_{1:7}).$

The red box on the right shows a decoder block for the first three target vectors $\mathbf{y}_0, \mathbf{y}_1, \mathbf{y}_2$. In the lower part, the uni-directional self-attention mechanism is illustrated and in the middle, the cross-attention mechanism is illustrated. Let's first focus on uni-directional self-attention.

As in bi-directional self-attention, in uni-directional self-attention,
the query vectors $\mathbf{q}_0, \ldots, \mathbf{q}_{m-1}$ (shown in
purple below), key vectors $\mathbf{k}_0, \ldots, \mathbf{k}_{m-1}$
(shown in orange below), and value vectors $\mathbf{v}_0, \ldots, \mathbf{v}_{m-1}$ (shown in blue below) are
projected from their respective input vectors $\mathbf{y'}_0, \ldots, \mathbf{y'}_{m-1}$ (shown in light red below).
However, in uni-directional self-attention, each query vector $\mathbf{q}_i$ is compared *only* to its respective key vector and all
previous ones, namely $\mathbf{k}_0, \ldots, \mathbf{k}_i$ to yield the
respective *attention weights*. This prevents an output vector $\mathbf{y''}_j$ (shown in dark red below) to include any information
about the following input vector $\mathbf{y}_i, \text{ with } i > j$ for
all $j \in \{0, \ldots, m - 1 \}$. As is the case in bi-directional
self-attention, the attention weights are then multiplied by their
respective value vectors and summed together.

We can summarize uni-directional self-attention as follows:

$\mathbf{y''}_i = \mathbf{V}_{0: i} \textbf{Softmax}(\mathbf{K}_{0: i}^\intercal \mathbf{q}_i) + \mathbf{y'}_i.$

Note that the index range of the key and value vectors is $0:i$ instead of $0: m-1$ which would be the range of the key vectors in bi-directional self-attention.

Let's illustrate uni-directional self-attention for the input vector $\mathbf{y'}_1$ for our example above.

As can be seen $\mathbf{y''}_1$ only depends on $\mathbf{y'}_0$ and $\mathbf{y'}_1$. Therefore, we put the vector representation of the word
"Ich", *i.e.* $\mathbf{y'}_1$ only into relation with itself and the
"BOS" target vector, *i.e.* $\mathbf{y'}_0$, but **not** with the
vector representation of the word "will", *i.e.* $\mathbf{y'}_2$.

So why is it important that we use uni-directional self-attention in the
decoder instead of bi-directional self-attention? As stated above, a
transformer-based decoder defines a mapping from a sequence of input
vector $\mathbf{Y}_{0: m-1}$ to the logits corresponding to the **next**
decoder input vectors, namely $\mathbf{L}_{1:m}$. In our example, this
means, *e.g.* that the input vector $\mathbf{y}_1$ = "Ich" is mapped
to the logit vector $\mathbf{l}_2$, which is then used to predict the
input vector $\mathbf{y}_2$. Thus, if $\mathbf{y'}_1$ would have access
to the following input vectors $\mathbf{Y'}_{2:5}$, the decoder would
simply copy the vector representation of "will", *i.e.* $\mathbf{y'}_2$, to be its output $\mathbf{y''}_1$. This would be
forwarded to the last layer so that the encoded output vector $\mathbf{\overline{y}}_1$ would essentially just correspond to the
vector representation $\mathbf{y}_2$.

This is obviously disadvantageous as the transformer-based decoder would
never learn to predict the next word given all previous words, but just
copy the target vector $\mathbf{y}_i$ through the network to $\mathbf{\overline{y}}_{i-1}$ for all $i \in \{1, \ldots, m \}$. In
order to define a conditional distribution of the next target vector,
the distribution cannot be conditioned on the next target vector itself.
It does not make much sense to predict $\mathbf{y}_i$ from $p(\mathbf{y} | \mathbf{Y}_{0:i}, \mathbf{\overline{X}})$ because the
distribution is conditioned on the target vector it is supposed to
model. The uni-directional self-attention architecture, therefore,
allows us to define a *causal* probability distribution, which is
necessary to effectively model a conditional distribution of the next
target vector.

Great! Now we can move to the layer that connects the encoder and
decoder - the *cross-attention* mechanism!

The cross-attention layer takes two vector sequences as inputs: the
outputs of the uni-directional self-attention layer, *i.e.* $\mathbf{Y''}_{0: m-1}$ and the contextualized encoding vectors $\mathbf{\overline{X}}_{1:n}$. As in the self-attention layer, the query
vectors $\mathbf{q}_0, \ldots, \mathbf{q}_{m-1}$ are projections of the
output vectors of the previous layer, *i.e.* $\mathbf{Y''}_{0: m-1}$.
However, the key and value vectors $\mathbf{k}_0, \ldots, \mathbf{k}_{m-1}$ and $\mathbf{v}_0, \ldots, \mathbf{v}_{m-1}$ are projections of the
contextualized encoding vectors $\mathbf{\overline{X}}_{1:n}$. Having
defined key, value, and query vectors, a query vector $\mathbf{q}_i$ is
then compared to *all* key vectors and the corresponding score is used
to weight the respective value vectors, just as is the case for
*bi-directional* self-attention to give the output vector $\mathbf{y'''}_i$ for all $i \in {0, \ldots, m-1}$. Cross-attention
can be summarized as follows:

$\mathbf{y'''}_i = \mathbf{V}_{1:n} \textbf{Softmax}(\mathbf{K}_{1: n}^\intercal \mathbf{q}_i) + \mathbf{y''}_i.$

Note that the index range of the key and value vectors is $1:n$ corresponding to the number of contextualized encoding vectors.

Let's visualize the cross-attention mechanism Let's for the input vector $\mathbf{y''}_1$ for our example above.

We can see that the query vector $\mathbf{q}_1$ (shown in purple) is derived from $\mathbf{y''}_1$(shown in red) and therefore relies on a vector representation of the word "Ich". The query vector $\mathbf{q}_1$ is then compared to the key vectors $\mathbf{k}_1, \ldots, \mathbf{k}_7$ (shown in yellow) corresponding to the contextual encoding representation of all encoder input vectors $\mathbf{X}_{1:n}$ = "I want to buy a car EOS". This puts the vector representation of "Ich" into direct relation with all encoder input vectors. Finally, the attention weights are multiplied by the value vectors $\mathbf{v}_1, \ldots, \mathbf{v}_7$ (shown in turquoise) to yield in addition to the input vector $\mathbf{y''}_1$ the output vector $\mathbf{y'''}_1$ (shown in dark red).

So intuitively, what happens here exactly? Each output vector $\mathbf{y'''}_i$ is a weighted sum of all value projections of the
encoder inputs $\mathbf{v}_{1}, \ldots, \mathbf{v}_7$ plus the input
vector itself $\mathbf{y''}_i$ (*c.f.* illustrated formula above). The key
mechanism to understand is the following: Depending on how similar a
query projection of the *input decoder vector* $\mathbf{q}_i$ is to a
key projection of the *encoder input vector* $\mathbf{k}_j$, the more
important is the value projection of the encoder input vector $\mathbf{v}_j$. In loose terms this means, the more "related" a
decoder input representation is to an encoder input representation, the
more does the input representation influence the decoder output
representation.

Cool! Now we can see how this architecture nicely conditions each output
vector $\mathbf{y'''}_i$ on the interaction between the encoder input
vectors $\mathbf{\overline{X}}_{1:n}$ and the input vector $\mathbf{y''}_i$. Another important observation at this point is that
the architecture is completely independent of the number $n$ of
contextualized encoding vectors $\mathbf{\overline{X}}_{1:n}$ on which
the output vector $\mathbf{y'''}_i$ is conditioned on. All projection
matrices $\mathbf{W}^{\text{cross}}_{k}$ and $\mathbf{W}^{\text{cross}}_{v}$ to derive the key vectors $\mathbf{k}_1, \ldots, \mathbf{k}_n$ and the value vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ respectively are shared across all
positions $1, \ldots, n$ and all value vectors $\mathbf{v}_1, \ldots, \mathbf{v}_n$ are summed together to a single
weighted averaged vector. Now it becomes obvious as well, why the
transformer-based decoder does not suffer from the long-range dependency
problem, the RNN-based decoder suffers from. Because each decoder logit
vector is *directly* dependent on every single encoded output vector,
the number of mathematical operations to compare the first encoded
output vector and the last decoder logit vector amounts essentially to
just one.

To conclude, the uni-directional self-attention layer is responsible for conditioning each output vector on all previous decoder input vectors and the current input vector and the cross-attention layer is responsible to further condition each output vector on all encoded input vectors.

To verify our theoretical understanding, let's continue our code example from the encoder section above.

${}^1$ The word embedding matrix $\mathbf{W}_{\text{emb}}$ gives each
input word a unique *context-independent* vector representation. This
matrix is often fixed as the "LM Head" layer. However, the "LM Head"
layer can very well consist of a completely independent "encoded
vector-to-logit" weight mapping.

${}^2$ Again, an in-detail explanation of the role the feed-forward
layers play in transformer-based models is out-of-scope for this
notebook. It is argued in Yun et. al,
(2017) that feed-forward layers
are crucial to map each contextual vector $\mathbf{x'}_i$ individually
to the desired output space, which the *self-attention* layer does not
manage to do on its own. It should be noted here, that each output token $\mathbf{x'}$ is processed by the same feed-forward layer. For more
detail, the reader is advised to read the paper.

```
from transformers import MarianMTModel, MarianTokenizer
import torch
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")
embeddings = model.get_input_embeddings()
# get encoded input vectors
input_ids = tokenizer("I want to buy a car", return_tensors="pt").input_ids
# create ids of encoded input vectors
decoder_input_ids = tokenizer("<pad> Ich will ein", return_tensors="pt", add_special_tokens=False).input_ids
# pass decoder input_ids and encoded input vectors to decoder
decoder_output_vectors = model.base_model.decoder(decoder_input_ids).last_hidden_state
# derive embeddings by multiplying decoder outputs with embedding weights
lm_logits = torch.nn.functional.linear(decoder_output_vectors, embeddings.weight, bias=model.final_logits_bias)
# change the decoder input slightly
decoder_input_ids_perturbed = tokenizer("<pad> Ich will das", return_tensors="pt", add_special_tokens=False).input_ids
decoder_output_vectors_perturbed = model.base_model.decoder(decoder_input_ids_perturbed).last_hidden_state
lm_logits_perturbed = torch.nn.functional.linear(decoder_output_vectors_perturbed, embeddings.weight, bias=model.final_logits_bias)
# compare shape and encoding of first vector
print(f"Shape of decoder input vectors {embeddings(decoder_input_ids).shape}. Shape of decoder logits {lm_logits.shape}")
# compare values of word embedding of "I" for input_ids and perturbed input_ids
print("Is encoding for `Ich` equal to its perturbed version?: ", torch.allclose(lm_logits[0, 0], lm_logits_perturbed[0, 0], atol=1e-3))
```

*Output:*

```
Shape of decoder input vectors torch.Size([1, 5, 512]). Shape of decoder logits torch.Size([1, 5, 58101])
Is encoding for `Ich` equal to its perturbed version?: True
```

We compare the output shape of the decoder input word embeddings, *i.e.*
`embeddings(decoder_input_ids)`

(corresponds to $\mathbf{Y}_{0: 4}$,
here `<pad>`

corresponds to BOS and "Ich will das" is tokenized to 4
tokens) with the dimensionality of the `lm_logits`

(corresponds to $\mathbf{L}_{1:5}$). Also, we have passed the word sequence
"`<pad>`

Ich will das" and a slightly perturbated version
"`<pad>`

Ich will das" together with the
`encoder_output_vectors`

through the encoder to check if the second
`lm_logit`

, corresponding to "Ich", differs when only the last word is
changed in the input sequence ("ein" -> "das").

As expected the output shapes of the decoder input word embeddings and
lm_logits, *i.e.* the dimensionality of $\mathbf{Y}_{0: 4}$ and $\mathbf{L}_{1:5}$ are different in the last dimension. While the
sequence length is the same (=5), the dimensionality of a decoder input
word embedding corresponds to `model.config.hidden_size`

, whereas the
dimensionality of a `lm_logit`

corresponds to the vocabulary size
`model.config.vocab_size`

, as explained above. Second, it can be noted
that the values of the encoded output vector of $\mathbf{l}_1 = \text{"Ich"}$ are the same when the last word is changed
from "ein" to "das". This however should not come as a surprise if
one has understood uni-directional self-attention.

On a final side-note, *auto-regressive* models, such as GPT2, have the
same architecture as *transformer-based* decoder models **if** one
removes the cross-attention layer because stand-alone auto-regressive
models are not conditioned on any encoder outputs. So auto-regressive
models are essentially the same as *auto-encoding* models but replace
bi-directional attention with uni-directional attention. These models
can also be pre-trained on massive open-domain text data to show
impressive performances on natural language generation (NLG) tasks. In
Radford et al.
(2019),
the authors show that a pre-trained GPT2 model can achieve SOTA or close
to SOTA results on a variety of NLG tasks without much fine-tuning. All
*auto-regressive* models of 🤗Transformers can be found
here.

Alright, that's it! Now, you should have gotten a good understanding of
*transformer-based* encoder-decoder models and how to use them with the
🤗Transformers library.

Thanks a lot to Victor Sanh, Sasha Rush, Sam Shleifer, Oliver Åstrand, Ted Moskovitz and Kristian Kyvik for giving valuable feedback.

##
**Appendix**

As mentioned above, the following code snippet shows how one can program
a simple generation method for *transformer-based* encoder-decoder
models. Here, we implement a simple *greedy* decoding method using
`torch.argmax`

to sample the target vector.

```
from transformers import MarianMTModel, MarianTokenizer
import torch
tokenizer = MarianTokenizer.from_pretrained("Helsinki-NLP/opus-mt-en-de")
model = MarianMTModel.from_pretrained("Helsinki-NLP/opus-mt-en-de")
# create ids of encoded input vectors
input_ids = tokenizer("I want to buy a car", return_tensors="pt").input_ids
# create BOS token
decoder_input_ids = tokenizer("<pad>", add_special_tokens=False, return_tensors="pt").input_ids
assert decoder_input_ids[0, 0].item() == model.config.decoder_start_token_id, "`decoder_input_ids` should correspond to `model.config.decoder_start_token_id`"
# STEP 1
# pass input_ids to encoder and to decoder and pass BOS token to decoder to retrieve first logit
outputs = model(input_ids, decoder_input_ids=decoder_input_ids, return_dict=True)
# get encoded sequence
encoded_sequence = (outputs.encoder_last_hidden_state,)
# get logits
lm_logits = outputs.logits
# sample last token with highest prob
next_decoder_input_ids = torch.argmax(lm_logits[:, -1:], axis=-1)
# concat
decoder_input_ids = torch.cat([decoder_input_ids, next_decoder_input_ids], axis=-1)
# STEP 2
# reuse encoded_inputs and pass BOS + "Ich" to decoder to second logit
lm_logits = model(None, encoder_outputs=encoded_sequence, decoder_input_ids=decoder_input_ids, return_dict=True).logits
# sample last token with highest prob again
next_decoder_input_ids = torch.argmax(lm_logits[:, -1:], axis=-1)
# concat again
decoder_input_ids = torch.cat([decoder_input_ids, next_decoder_input_ids], axis=-1)
# STEP 3
lm_logits = model(None, encoder_outputs=encoded_sequence, decoder_input_ids=decoder_input_ids, return_dict=True).logits
next_decoder_input_ids = torch.argmax(lm_logits[:, -1:], axis=-1)
decoder_input_ids = torch.cat([decoder_input_ids, next_decoder_input_ids], axis=-1)
# let's see what we have generated so far!
print(f"Generated so far: {tokenizer.decode(decoder_input_ids[0], skip_special_tokens=True)}")
# This can be written in a loop as well.
```

*Outputs:*

```
Generated so far: Ich Ich
```

In this code example, we show exactly what was described earlier. We
pass an input "I want to buy a car" together with the $\text{BOS}$
token to the encoder-decoder model and sample from the first logit $\mathbf{l}_1$ (*i.e.* the first `lm_logits`

line). Hereby, our sampling
strategy is simple to greedily choose the next decoder input vector that
has the highest probability. In an auto-regressive fashion, we then pass
the sampled decoder input vector together with the previous inputs to
the encoder-decoder model and sample again. We repeat this a third time.
As a result, the model has generated the words "Ich Ich". The first
word is spot-on! The second word is not that great. We can see here,
that a good decoding method is key for a successful sequence generation
from a given model distribution.

In practice, more complicated decoding methods are used to sample the
`lm_logits`

. Most of which are covered in
this blog post.