Spaces:
Paused
Paused
20 | |
FUNDAMENTALS | |
Quadratic complexity | |
Figure 2.3. Illustration of the main attention mechanisms in a Transformer. | |
embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the | |
frequency and φ is the phase which are learned as part of the training process, | |
and they are typically shared across all tokens in the sequence. Integrating | |
position information into Transformers can be achieved in different ways, which | |
[105, Table 1] gives an overview for. | |
Transformers have gradually taken over as an end-to-end architecture for both | |
NLP and CV tasks, albeit adoption in CV has been slower, due to the lack | |
of spatial invariance in the original Transformer architecture. This has been | |
addressed by recent works, such as Vision Transformer (ViT) [101], which uses | |
a patch-based input representation with position embeddings. | |
A large language model (LLM) consists of a stack of Transformers that is | |
pretrained on a large corpus of text, typically using a self-supervised learning | |
objective, such as predicting the next token in a sequence. The goal of LLMs | |
is to learn a general-purpose language representation that can be fine-tuned | |
to perform well on a wide range of downstream tasks. LLMs have disrupted | |
NLP in recent years, as they have achieved SOTA performance on a wide | |
range of tasks thanks to pretraining on large amounts of data. The most | |
popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383], | |
GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to | |
modeling document inputs, explained in Section 2.3.4, open challenges for | |
LLMs include: (i) structured output generation, (ii) domain-specific knowledge | |
injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]), | |
(iii) multimodality. | |
Vision-language models (VLM) are a recent development in multimodal | |
learning, which combine the power of LLMs with vision encoders to perform | |
tasks that require understanding both visual and textual information. The most | |
popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344]. | |
In every chapter of this dissertation we have used Transformers, either as part | |