Understanding Machine Learning: From Theory to Algorithms

302 Online Learning

w∈Rd}. In Section 9.1.2 we have presented the batch version of the Perceptron, which aims to solve the ERM problem with respect toH. We now present an online version of the Perceptron algorithm. LetX=Rd,Y={− 1 , 1 }. On roundt, the learner receives a vectorxt∈Rd. The learner maintains a weight vectorw(t)∈Rdand predictspt= sign(〈w(t),xt〉). Then, it receivesyt∈Yand pays 1 ifpt 6 =ytand 0 otherwise. The goal of the learner is to make as few prediction mistakes as possible. In Section 21.1 we characterized the optimal algorithm and showed that the best achievable mistake bound depends on the Littlestone dimension of the class. We show later that ifd≥2 then Ldim(H) =∞, which implies that we have no hope of making few prediction mistakes. Indeed, consider the tree for which v 1 = (^12 , 1 , 0 ,...,0),v 2 = (^14 , 1 , 0 ,...,0),v 3 = (^34 , 1 , 0 ,...,0), etc. Because of the density of the reals, this tree is shattered by the subset ofHwhich contains all hypotheses that are parametrized bywof the formw= (− 1 ,a, 0 ,...,0), for a∈[0,1]. We conclude that indeed Ldim(H) =∞. To sidestep this impossibility result, the Perceptron algorithm relies on the technique ofsurrogate convex losses(see Section 12.3). This is also closely related to the notion ofmarginwe studied in Chapter 15. A weight vectorwmakes a mistake on an example (x,y) whenever the sign of 〈w,x〉does not equaly. Therefore, we can write the 0−1 loss function as follows

`(w,(x,y)) = (^1) [y〈w,x〉≤0].
On rounds on which the algorithm makes a prediction mistake, we shall use the
hinge-loss as a surrogate convex loss function
ft(w) = max{ 0 , 1 −yt〈w,xt〉}.
The hinge-loss satisfies the two conditions:

ftis a convex function

For allw,ft(w)≥`(w,(xt,yt)). In particular, this holds forw(t).

On rounds on which the algorithm is correct, we shall defineft(w) = 0. Clearly, ftis convex in this case as well. Furthermore,ft(w(t)) =`(w(t),(xt,yt)) = 0. Remark 21.5 In Section 12.3 we used the same surrogate loss function for all the examples. In the online model, we allow the surrogate to depend on the specific round. It can even depend onw(t). Our ability to use a round specific surrogate stems from the worst-case type of analysis we employ in online learning. Let us now run theOnline Gradient Descentalgorithm on the sequence of functions,f 1 ,...,fT, with the hypothesis class being all vectors inRd(hence, the projection step is vacuous). Recall that the algorithm initializesw(1)= 0 and its update rule is w(t+1)=w(t)−ηvt

for somevt∈∂ft(w(t)). In our case, ifyt〈w(t),xt〉>0 thenft is the zero

Understanding Machine Learning: From Theory to Algorithms

Get our desktop app

Company

Features

Documentation

Resources