Spaces:

jordyvl
/

ask_my_thesis

Paused

App Files Files Community

ask_my_thesis / assets /txts /pg_0052.txt

jordyvl

First commit

e0a78f5 8 months ago

raw

history blame

2.14 kB

	20

	FUNDAMENTALS

	Quadratic complexity

	Figure 2.3. Illustration of the main attention mechanisms in a Transformer.

	embedding with a fixed frequency and phase, f (x) = sin(ωx + φ), where ω is the
	frequency and φ is the phase which are learned as part of the training process,
	and they are typically shared across all tokens in the sequence. Integrating
	position information into Transformers can be achieved in different ways, which
	[105, Table 1] gives an overview for.
	Transformers have gradually taken over as an end-to-end architecture for both
	NLP and CV tasks, albeit adoption in CV has been slower, due to the lack
	of spatial invariance in the original Transformer architecture. This has been
	addressed by recent works, such as Vision Transformer (ViT) [101], which uses
	a patch-based input representation with position embeddings.
	A large language model (LLM) consists of a stack of Transformers that is
	pretrained on a large corpus of text, typically using a self-supervised learning
	objective, such as predicting the next token in a sequence. The goal of LLMs
	is to learn a general-purpose language representation that can be fine-tuned
	to perform well on a wide range of downstream tasks. LLMs have disrupted
	NLP in recent years, as they have achieved SOTA performance on a wide
	range of tasks thanks to pretraining on large amounts of data. The most
	popular LLMs are BERT [95], RoBERTa [287], ELECTRA [73], T5 [383],
	GPT-3 [52], Llama-2 [452], and Mistral [199]. Next to challenges specific to
	modeling document inputs, explained in Section 2.3.4, open challenges for
	LLMs include: (i) structured output generation, (ii) domain-specific knowledge
	injection (e.g., does retrieval-augmented generation (RAG) suffice? [253, 347]),
	(iii) multimodality.
	Vision-language models (VLM) are a recent development in multimodal
	learning, which combine the power of LLMs with vision encoders to perform
	tasks that require understanding both visual and textual information. The most
	popular VLMs are CLIP [381], UNITER [70], FLAVA [423] and GPT-4 [344].
	In every chapter of this dissertation we have used Transformers, either as part