Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

40 A Gentle Start


to the event∀i,h(xi) =f(xi). Since the examples in the training set are sampled
i.i.d. we get that

Dm({S|x:LS(h) = 0}) =Dm({S|x:∀i,h(xi) =f(xi)})

=

∏m

i=1

D({xi:h(xi) =f(xi)}). (2.8)

For each individual sampling of an element of the training set we have

D({xi:h(xi) =yi}) = 1−L(D,f)(h)≤ 1 −,

where the last inequality follows from the fact thath∈ HB. Combining the
previous equation with Equation (2.8) and using the inequality 1−≤e−we
obtain that for everyh∈HB,

Dm({S|x:LS(h) = 0})≤(1−)m≤e−m. (2.9)

Combining this equation with Equation (2.7) we conclude that

Dm({S|x:L(D,f)(hS)> }) ≤ |HB|e−m ≤ |H|e−m.

A graphical illustration which explains how we used the union bound is given in
Figure 2.1.

Figure 2.1Each point in the large circle represents a possiblem-tuple of instances.
Each colored oval represents the set of “misleading”m-tuple of instances for some
“bad” predictorh∈HB. The ERM can potentially overfit whenever it gets a
misleading training setS. That is, for someh∈HBwe haveLS(h) = 0.
Equation (2.9) guarantees that for each individual bad hypothesis,h∈HB, at most
(1−)m-fraction of the training sets would be misleading. In particular, the largerm
is, the smaller each of these colored ovals becomes. The union bound formalizes the
fact that the area representing the training sets that are misleading with respect to
someh∈HB(that is, the training sets inM) is at most the sum of the areas of the
colored ovals. Therefore, it is bounded by|HB|times the maximum size of a colored
oval. Any sampleSoutside the colored ovals cannot cause the ERM rule to overfit.

corollary2.3 LetHbe a finite hypothesis class. Letδ∈(0,1)and > 0
Free download pdf