Understanding Machine Learning: From Theory to Algorithms

21.2 Online Classification in the Unrealizable Case 299

Theorem 21.11 tells us that the expected number of mistakes ofWeighted-Majority
is at most the number of mistakes of the best expert plus

√

2 log(d)T. We will
next show that the number of mistakes of the best expert is at most the number
of mistakes of the best hypothesis inH. The following key lemma shows that,
on any sequence of instances, for each hypothesish∈ Hthere exists an expert
with the same behavior.

lemma21.13 LetHbe any hypothesis class withLdim(H)<∞. Letx 1 ,x 2 ,...,xT
be any sequence of instances. For anyh∈H, there existsL≤Ldim(H)and in-
dices 1 ≤i 1 < i 2 <···< iL≤Tsuch that when runningExpert(i 1 ,i 2 ,...,iL)
on the sequencex 1 ,x 2 ,...,xT, the expert predictsh(xt)on each online round
t= 1, 2 ,...,T.

Proof Fixh∈Hand the sequencex 1 ,x 2 ,...,xT. We must constructLand the
indicesi 1 ,i 2 ,...,iL. Consider running SOA on the input (x 1 ,h(x 1 )), (x 2 ,h(x 2 )),
..., (xT,h(xT)). SOA makes at most Ldim(H) mistakes on such input. We define
Lto be the number of mistakes made by SOA and we define{i 1 ,i 2 ,...,iL}to
be the set of rounds in which SOA made the mistakes.
Now, consider the Expert(i 1 ,i 2 ,...,iL) running on the sequencex 1 ,x 2 ,...,xT.
By construction, the setVtmaintained by Expert(i 1 ,i 2 ,...,iL) equals the setVt
maintained by SOA when running on the sequence (x 1 ,h(x 1 )),...,(xT,h(xT)).
The predictions of SOA differ from the predictions ofhif and only if the round is
in{i 1 ,i 2 ,...,iL}. Since Expert(i 1 ,i 2 ,...,iL) predicts exactly like SOA iftis not
in{i 1 ,i 2 ,...,iL}and the opposite of SOAs’ predictions iftis in{i 1 ,i 2 ,...,iL},
we conclude that the predictions of the expert are always the same as the pre-
dictions ofh.

The previous lemma holds in particular for the hypothesis inHthat makes the
least number of mistakes on the sequence of examples, and we therefore obtain
the following:

corollary21.14 Let(x 1 ,y 1 ),(x 2 ,y 2 ),...,(xT,yT)be a sequence of examples
and letHbe a hypothesis class withLdim(H)<∞. There existsL≤Ldim(H)
and indices 1 ≤i 1 < i 2 <···< iL≤T, such thatExpert(i 1 ,i 2 ,...,iL)makes
at most as many mistakes as the besth∈Hdoes, namely,

minh∈H

∑T

t=1

|h(xt)−yt|

mistakes on the sequence of examples.

Together with Theorem 21.11, the upper bound part of Theorem 21.10 is
proven.

Understanding Machine Learning: From Theory to Algorithms

√

∑T

Get our desktop app

Company

Features

Documentation

Resources