STATISTICAL LEARNING 13 possible functions. The objective is to find a function f ∈ F that minimizes the risk, or even better, the Bayes risk f ∗ = inf R(f ), f ∈F (2.2) which is the minimum achievable risk over all functions in F. The latter is only realizable with infinite data or having access to the data-generating distribution P(X , Y). In practice, Equation (2.2) is unknown, and the goal is to find a function fˆ that minimizes the empirical risk N 1 X `(yi , f (xi )), fˆ = N i=1 (2.3) where (xi , yi ) are N independently and identically distributed (i.i.d.) samples drawn from an unknown distribution P on X × Y. This is known as empirical risk minimization (ERM), which is a popular approach to supervised learning, under which three important processes are defined. Training or model fitting is the process of estimating the parameters θ of a model, which is done by minimizing a suitable loss function ` over a training set D = {(xi , yi )}N i=1 of N i.i.d. samples. Inference or prediction is the process of estimating the output of a model for a given input, which is typically done by computing the posterior probability P (y|x) over the output space Y. Classification output is a discrete label, while regression output is a continuous value. Evaluation involves measuring the quality of a model’s predictions, which is typically done by computing a suitable evaluation metric over a test set Dtest of i.i.d. samples, which were not used for training. However, ERM has its caveats concerning generalization to unseen data, requiring either additional assumptions on the hypothesis class F, which are known as inductive biases, and/or regularization to penalize the complexity of the function class F [445]. In neural networks (discussed in detail Section 2.1.1), the former is controlled by the architecture of the network, while the latter involves specifying constraints to parameters or adding a regularization term to the loss function.   fˆ = arg min R̂(f ) + λΨ(θ) f ∈F (2.4)