Understanding Machine Learning: From Theory to Algorithms

152 Model Selection and Validation

Instead, we give a different error decomposition, one that can be estimated from the train and validation sets.

LD(hS) = (LD(hS)−LV(hS)) + (LV(hS)−LS(hS)) +LS(hS).

The first term, (LD(hS)−LV(hS)), can be bounded quite tightly using Theo- rem 11.1. Intuitively, when the second term, (LV(hS)−LS(hS)), is large we say that our algorithm suffers from “overfitting” while when the empirical risk term, LS(hS), is large we say that our algorithm suffers from “underfitting.” Note that these two terms are not necessarily good estimates of the estimation and approximation errors. To illustrate this, consider the case in whichHis a class of VC-dimensiond, andDis a distribution such that the approximation error ofH with respect toDis 1/4. As long as the size of our training set is smaller than dwe will haveLS(hS) = 0 for every ERM hypothesis. Therefore, the training risk,LS(hS), and the approximation error,LD(h?), can be significantly different. Nevertheless, as we show later, the values ofLS(hS) and (LV(hS)−LS(hS)) still provide us useful information. Consider first the case in whichLS(hS) is large. We can write

LS(hS) = (LS(hS)−LS(h?)) + (LS(h?)−LD(h?)) +LD(h?).

WhenhSis an ERMHhypothesis we have thatLS(hS)−LS(h?)≤0. In addition, sinceh?does not depend onS, the term (LS(h?)−LD(h?)) can be bounded quite tightly (as in Theorem 11.1). The last term is the approximation error. It follows that ifLS(hS) is large then so is the approximation error, and the remedy to the failure of our algorithm should be tailored accordingly (as discussed previously). Remark 11.1 It is possible that the approximation error of our class is small, yet the value ofLS(hS) is large. For example, maybe we had a bug in our ERM implementation, and the algorithm returns a hypothesishSthat is not an ERM. It may also be the case that finding an ERM hypothesis is computationally hard, and our algorithm applies some heuristic trying to find an approximate ERM. In some cases, it is hard to know how goodhSis relative to an ERM hypothesis. But, sometimes it is possible at least to know whether there are better hypotheses. For example, in the next chapter we will study convex learning problems in which there are optimality conditions that can be checked to verify whether our optimization algorithm converged to an ERM solution. In other cases, the solution may depend on randomness in initializing the algorithm, so we can try different randomly selected initial points to see whether better solutions pop out. Next consider the case in whichLS(hS) is small. As we argued before, this does not necessarily imply that the approximation error is small. Indeed, consider two scenarios, in both of which we are trying to learn a hypothesis class of VC-dimensiondusing the ERM learning rule. In the first scenario, we have a training set ofm < dexamples and the approximation error of the class is high. In the second scenario, we have a training set ofm > 2 dexamples and the

Understanding Machine Learning: From Theory to Algorithms

Get our desktop app

Company

Features

Documentation

Resources