Understanding Machine Learning: From Theory to Algorithms

21.4 The Online Perceptron Algorithm 303

function and we can takevt= 0. Otherwise, it is easy to verify thatvt=−ytxt
is in∂ft(w(t)). We therefore obtain the update rule

w(t+1)=

{

w(t) ifyt〈w(t),xt〉> 0 w(t)+ηytxt otherwise

Denote byMthe set of rounds in which sign(〈w(t),xt〉) 6 =yt. Note that on
roundt, the prediction of the Perceptron can be rewritten as

pt= sign(〈w(t),xt〉) = sign

(

η

∑

i∈M:i<t

yi〈xi,xt〉

)

.

This form implies that the predictions of the Perceptron algorithm and the set
Mdo not depend on the actual value ofηas long asη >0. We have therefore
obtained the Perceptron algorithm:

Perceptron initialize:w 1 = 0 for t= 1, 2 ,...,T receivext predictpt= sign(〈w(t),xt〉) ifyt〈w(t),xt〉≤ 0 w(t+1)=w(t)+ytxt else w(t+1)=w(t)

To analyze the Perceptron, we rely on the analysis of Online Gradient De-
scent given in the previous section. In our case, the subgradient offtwe use

in the Perceptron isvt=− (^1) [yt〈w(t),xt〉≤0]ytxt. Indeed, the Perceptron’s update
isw(t+1) =w(t)−vt, and as discussed before this is equivalent tow(t+1)=
w(t)−ηvtfor everyη >0. Therefore, Theorem 21.15 tells us that
∑T
t=1
ft(w(t))−

∑T

t=1

ft(w?)≤

1

2 η

‖w?‖^22 +

η 2

∑T

t=1

‖vt‖^22.

Sinceft(w(t)) is a surrogate for the 0−1 loss we know that

∑T

t=1ft(w

(t))≥|M|.

DenoteR= maxt‖xt‖; then we obtain

|M|−

∑T

t=1

ft(w?)≤

1

2 η

‖w?‖^22 +

η 2

|M|R^2

Settingη= ‖w

?‖ R

√

|M|and rearranging, we obtain

|M|−R‖w?‖

√

|M|−

∑T

t=1

ft(w?)≤ 0. (21.6)

This inequality implies

Understanding Machine Learning: From Theory to Algorithms

{

(

∑

)

.

∑T

1

∑T

∑T

|M|−

∑T

1

|M|R^2

√

√

|M|−

∑T

Get our desktop app

Company

Features

Documentation

Resources