Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

152 Model Selection and Validation


Instead, we give a different error decomposition, one that can be estimated from
the train and validation sets.

LD(hS) = (LD(hS)−LV(hS)) + (LV(hS)−LS(hS)) +LS(hS).

The first term, (LD(hS)−LV(hS)), can be bounded quite tightly using Theo-
rem 11.1. Intuitively, when the second term, (LV(hS)−LS(hS)), is large we say
that our algorithm suffers from “overfitting” while when the empirical risk term,
LS(hS), is large we say that our algorithm suffers from “underfitting.” Note that
these two terms are not necessarily good estimates of the estimation and ap-
proximation errors. To illustrate this, consider the case in whichHis a class of
VC-dimensiond, andDis a distribution such that the approximation error ofH
with respect toDis 1/4. As long as the size of our training set is smaller than
dwe will haveLS(hS) = 0 for every ERM hypothesis. Therefore, the training
risk,LS(hS), and the approximation error,LD(h?), can be significantly different.
Nevertheless, as we show later, the values ofLS(hS) and (LV(hS)−LS(hS)) still
provide us useful information.
Consider first the case in whichLS(hS) is large. We can write

LS(hS) = (LS(hS)−LS(h?)) + (LS(h?)−LD(h?)) +LD(h?).

WhenhSis an ERMHhypothesis we have thatLS(hS)−LS(h?)≤0. In addition,
sinceh?does not depend onS, the term (LS(h?)−LD(h?)) can be bounded quite
tightly (as in Theorem 11.1). The last term is the approximation error. It follows
that ifLS(hS) is large then so is the approximation error, and the remedy to the
failure of our algorithm should be tailored accordingly (as discussed previously).
Remark 11.1 It is possible that the approximation error of our class is small,
yet the value ofLS(hS) is large. For example, maybe we had a bug in our ERM
implementation, and the algorithm returns a hypothesishSthat is not an ERM.
It may also be the case that finding an ERM hypothesis is computationally hard,
and our algorithm applies some heuristic trying to find an approximate ERM. In
some cases, it is hard to know how goodhSis relative to an ERM hypothesis. But,
sometimes it is possible at least to know whether there are better hypotheses.
For example, in the next chapter we will study convex learning problems in
which there are optimality conditions that can be checked to verify whether
our optimization algorithm converged to an ERM solution. In other cases, the
solution may depend on randomness in initializing the algorithm, so we can try
different randomly selected initial points to see whether better solutions pop out.
Next consider the case in whichLS(hS) is small. As we argued before, this
does not necessarily imply that the approximation error is small. Indeed, consider
two scenarios, in both of which we are trying to learn a hypothesis class of
VC-dimensiondusing the ERM learning rule. In the first scenario, we have a
training set ofm < dexamples and the approximation error of the class is high.
In the second scenario, we have a training set ofm > 2 dexamples and the
Free download pdf