Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

302 Online Learning


w∈Rd}. In Section 9.1.2 we have presented the batch version of the Perceptron,
which aims to solve the ERM problem with respect toH. We now present an
online version of the Perceptron algorithm.
LetX=Rd,Y={− 1 , 1 }. On roundt, the learner receives a vectorxt∈Rd.
The learner maintains a weight vectorw(t)∈Rdand predictspt= sign(〈w(t),xt〉).
Then, it receivesyt∈Yand pays 1 ifpt 6 =ytand 0 otherwise.
The goal of the learner is to make as few prediction mistakes as possible. In
Section 21.1 we characterized the optimal algorithm and showed that the best
achievable mistake bound depends on the Littlestone dimension of the class.
We show later that ifd≥2 then Ldim(H) =∞, which implies that we have
no hope of making few prediction mistakes. Indeed, consider the tree for which
v 1 = (^12 , 1 , 0 ,...,0),v 2 = (^14 , 1 , 0 ,...,0),v 3 = (^34 , 1 , 0 ,...,0), etc. Because of
the density of the reals, this tree is shattered by the subset ofHwhich contains
all hypotheses that are parametrized bywof the formw= (− 1 ,a, 0 ,...,0), for
a∈[0,1]. We conclude that indeed Ldim(H) =∞.
To sidestep this impossibility result, the Perceptron algorithm relies on the
technique ofsurrogate convex losses(see Section 12.3). This is also closely related
to the notion ofmarginwe studied in Chapter 15.
A weight vectorwmakes a mistake on an example (x,y) whenever the sign of
〈w,x〉does not equaly. Therefore, we can write the 0−1 loss function as follows

`(w,(x,y)) = (^1) [y〈w,x〉≤0].
On rounds on which the algorithm makes a prediction mistake, we shall use the
hinge-loss as a surrogate convex loss function
ft(w) = max{ 0 , 1 −yt〈w,xt〉}.
The hinge-loss satisfies the two conditions:



  • ftis a convex function

  • For allw,ft(w)≥`(w,(xt,yt)). In particular, this holds forw(t).


On rounds on which the algorithm is correct, we shall defineft(w) = 0. Clearly,
ftis convex in this case as well. Furthermore,ft(w(t)) =`(w(t),(xt,yt)) = 0.
Remark 21.5 In Section 12.3 we used the same surrogate loss function for all the
examples. In the online model, we allow the surrogate to depend on the specific
round. It can even depend onw(t). Our ability to use a round specific surrogate
stems from the worst-case type of analysis we employ in online learning.
Let us now run theOnline Gradient Descentalgorithm on the sequence of
functions,f 1 ,...,fT, with the hypothesis class being all vectors inRd(hence,
the projection step is vacuous). Recall that the algorithm initializesw(1)= 0
and its update rule is
w(t+1)=w(t)−ηvt

for somevt∈∂ft(w(t)). In our case, ifyt〈w(t),xt〉>0 thenft is the zero
Free download pdf