14 FUNDAMENTALS Equation (2.4) defines regularized empirical risk minimization (RERM), where Ψ(θ) is a regularization term and λ is a hyperparameter that controls the trade-off between the empirical risk (denoted with R̂) and the regularization term. All these concepts will be revisited in the context of neural networks in Section 2.1.1, where we will also discuss the optimization process of the model parameters θ, how inference differs in the case of probabilistic models to estimate uncertainty (Section 2.2.5), and how regularization affects confidence estimation and calibration (Section 2.2.4). 2.1.1 Neural Networks An artificial neural network (NN) is a mathematical approximation inspired by data processing in the human brain [396]. It can be represented by a network topology of interconnected neurons that are organized in layers that successively refine intermediately learned feature representations of the input [448] that are useful for the task at hand, e.g., classifying an animal by means of its size, shape and fur, or detecting the sentiment of a review by focusing on adjectives. A basic NN building block is a linear layer, which is a linear function of the input parameters: f (x) = W x + b, where the bias term b is a constant vector shifting the decision boundary away from the origin and the weight matrix W holds most parameters that rotate the decision boundary in input space. Activation functions (e.g., tanh, ReLu, sigmoid, softmax, GeLu) are used to introduce non-linearity in the model, which is required for learning complex functions. The first deep learning (DL) network (stacking multiple linear layers) dates back to 1965 [191], yet the term ‘Deep Learning’ was coined in 1986 [398]. The first successful DL application was a demonstration of digit recognition in 1998 [244], followed by DL for CV [90, 223] and NLP [76]. The recent success of DL is attributed to the availability of large datasets, the increase in computational power, the development of new algorithms and architectures, and the commercial interest of large companies. Consider a conventional DL architecture as a composition of parameterized functions. Each consists of a configuration of layers (e.g., convolution, pooling, activation function, normalization, embeddings) determining the type of input transformation (e.g., convolutional, recurrent, attention) with (trainable) parameters linear/non-linear w.r.t. the input x. Given the type of input, e.g., language which is naturally discrete-sequential, or vision which presents a