Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
21.4 The Online Perceptron Algorithm 303

function and we can takevt= 0. Otherwise, it is easy to verify thatvt=−ytxt
is in∂ft(w(t)). We therefore obtain the update rule


w(t+1)=

{

w(t) ifyt〈w(t),xt〉> 0
w(t)+ηytxt otherwise

Denote byMthe set of rounds in which sign(〈w(t),xt〉) 6 =yt. Note that on
roundt, the prediction of the Perceptron can be rewritten as


pt= sign(〈w(t),xt〉) = sign

(

η


i∈M:i<t

yi〈xi,xt〉

)

.

This form implies that the predictions of the Perceptron algorithm and the set
Mdo not depend on the actual value ofηas long asη >0. We have therefore
obtained the Perceptron algorithm:


Perceptron
initialize:w 1 = 0
for t= 1, 2 ,...,T
receivext
predictpt= sign(〈w(t),xt〉)
ifyt〈w(t),xt〉≤ 0
w(t+1)=w(t)+ytxt
else
w(t+1)=w(t)

To analyze the Perceptron, we rely on the analysis of Online Gradient De-
scent given in the previous section. In our case, the subgradient offtwe use


in the Perceptron isvt=− (^1) [yt〈w(t),xt〉≤0]ytxt. Indeed, the Perceptron’s update
isw(t+1) =w(t)−vt, and as discussed before this is equivalent tow(t+1)=
w(t)−ηvtfor everyη >0. Therefore, Theorem 21.15 tells us that
∑T
t=1
ft(w(t))−


∑T

t=1

ft(w?)≤

1

2 η

‖w?‖^22 +

η
2

∑T

t=1

‖vt‖^22.

Sinceft(w(t)) is a surrogate for the 0−1 loss we know that


∑T

t=1ft(w

(t))≥|M|.

DenoteR= maxt‖xt‖; then we obtain


|M|−

∑T

t=1

ft(w?)≤

1

2 η

‖w?‖^22 +

η
2

|M|R^2

Settingη= ‖w


?‖
R


|M|and rearranging, we obtain

|M|−R‖w?‖


|M|−

∑T

t=1

ft(w?)≤ 0. (21.6)

This inequality implies

Free download pdf