14

FUNDAMENTALS

Equation (2.4) defines regularized empirical risk minimization (RERM),
where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the
trade-off between the empirical risk (denoted with R̂) and the regularization
term.
All these concepts will be revisited in the context of neural networks in
Section 2.1.1, where we will also discuss the optimization process of the model
parameters θ, how inference differs in the case of probabilistic models to estimate
uncertainty (Section 2.2.5), and how regularization affects confidence estimation
and calibration (Section 2.2.4).

2.1.1

Neural Networks

An artificial neural network (NN) is a mathematical approximation inspired
by data processing in the human brain [396]. It can be represented by a
network topology of interconnected neurons that are organized in layers that
successively refine intermediately learned feature representations of the input
[448] that are useful for the task at hand, e.g., classifying an animal by means
of its size, shape and fur, or detecting the sentiment of a review by focusing on
adjectives.
A basic NN building block is a linear layer, which is a linear function of the
input parameters: f (x) = W x + b, where the bias term b is a constant vector
shifting the decision boundary away from the origin and the weight matrix
W holds most parameters that rotate the decision boundary in input space.
Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to
introduce non-linearity in the model, which is required for learning complex
functions.
The first deep learning (DL) network (stacking multiple linear layers) dates
back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398].
The first successful DL application was a demonstration of digit recognition
in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent
success of DL is attributed to the availability of large datasets, the increase in
computational power, the development of new algorithms and architectures,
and the commercial interest of large companies.
Consider a conventional DL architecture as a composition of parameterized
functions. Each consists of a configuration of layers (e.g., convolution, pooling,
activation function, normalization, embeddings) determining the type of input
transformation (e.g., convolutional, recurrent, attention) with (trainable)
parameters linear/non-linear w.r.t. the input x. Given the type of input,
e.g., language which is naturally discrete-sequential, or vision which presents a