STATISTICAL LEARNING 15 Sigmoid Function 1 σ(z) = 1 + exp−z Softmax Function exp(z) softmax(z) = PK k=1 exp(zk ) Table 2.1. Sigmoid and softmax activation functions for binary and multi-class classification, respectively. ready continuous-spatial signal, different DL architectures have been established, which will be discussed in Section 2.1.3. A K-class classification function with an l-layer NN with d dimensional input x ∈ Rd is shorthand fθ : Rd → RK , with θ = {θj }lj=1 assumed to be optimized, either partially or fully, using backpropagation and a loss function. More specifically, it presents a non-convex optimization problem, concerning multiple feasible regions with multiple locally optimal points within each. With maximumlikelihood estimation estimation, the goal is to find the optimal parameters or weights that minimize the loss function, effectively interpolating the training data. This process involves traversing the high-dimensional loss landscape. Upon convergence of model training, the optimized parameters form a solution in the weight-space, representing a unique mode (specific function fθ̂ ). However, when regularization techniques such as weight decay, dropout, or early stopping are applied, the objective shifts towards maximum-a-posteriori (MAP), to take into account the prior probability of the parameters. The difference in parameter estimation forms the basis for several uncertainty estimation methods, covered in Section 2.2.5. A prediction is a translation of a model’s output to which a standard decision rule is applied, e.g., to obtain the top-1/k prediction (Equation (2.5)), or decode structured output according to a function maximizing total likelihood with optionally additional diversity criteria. ŷ = argmax fθ̂ (x) (2.5) Considering standard NNs, the last layer outputs a vector of real-valued logits z ∈ RK , which in turn are normalized to a probability distribution over K classes using a sigmoid or softmax function (Table 2.1). 2.1.2 Probabilistic Evaluation The majority of our works involves supervised learning with NNs, formulated generically as a probabilistic predictor in Definition 1.