Understanding Machine Learning: From Theory to Algorithms

40 A Gentle Start

to the event∀i,h(xi) =f(xi). Since the examples in the training set are sampled i.i.d. we get that

Dm({S|x:LS(h) = 0}) =Dm({S|x:∀i,h(xi) =f(xi)})

=

∏m

i=1

D({xi:h(xi) =f(xi)}). (2.8)

For each individual sampling of an element of the training set we have

D({xi:h(xi) =yi}) = 1−L(D,f)(h)≤ 1 −,

where the last inequality follows from the fact thath∈ HB. Combining the previous equation with Equation (2.8) and using the inequality 1−≤e−we obtain that for everyh∈HB,

Dm({S|x:LS(h) = 0})≤(1−)m≤e−m. (2.9)

Combining this equation with Equation (2.7) we conclude that

Dm({S|x:L(D,f)(hS)> }) ≤ |HB|e−m ≤ |H|e−m.

A graphical illustration which explains how we used the union bound is given in Figure 2.1.

Figure 2.1Each point in the large circle represents a possiblem-tuple of instances. Each colored oval represents the set of “misleading”m-tuple of instances for some “bad” predictorh∈HB. The ERM can potentially overfit whenever it gets a misleading training setS. That is, for someh∈HBwe haveLS(h) = 0. Equation (2.9) guarantees that for each individual bad hypothesis,h∈HB, at most (1−)m-fraction of the training sets would be misleading. In particular, the largerm is, the smaller each of these colored ovals becomes. The union bound formalizes the fact that the area representing the training sets that are misleading with respect to someh∈HB(that is, the training sets inM) is at most the sum of the areas of the colored ovals. Therefore, it is bounded by|HB|times the maximum size of a colored oval. Any sampleSoutside the colored ovals cannot cause the ERM rule to overfit.

corollary2.3 LetHbe a finite hypothesis class. Letδ∈(0,1)and > 0

Understanding Machine Learning: From Theory to Algorithms

Get our desktop app

Company

Features

Documentation

Resources