21.4 The Online Perceptron Algorithm 303
function and we can takevt= 0. Otherwise, it is easy to verify thatvt=−ytxt
is in∂ft(w(t)). We therefore obtain the update rule
w(t+1)=
{
w(t) ifyt〈w(t),xt〉> 0
w(t)+ηytxt otherwise
Denote byMthe set of rounds in which sign(〈w(t),xt〉) 6 =yt. Note that on
roundt, the prediction of the Perceptron can be rewritten as
pt= sign(〈w(t),xt〉) = sign
(
η
∑
i∈M:i<t
yi〈xi,xt〉
)
.
This form implies that the predictions of the Perceptron algorithm and the set
Mdo not depend on the actual value ofηas long asη >0. We have therefore
obtained the Perceptron algorithm:
Perceptron
initialize:w 1 = 0
for t= 1, 2 ,...,T
receivext
predictpt= sign(〈w(t),xt〉)
ifyt〈w(t),xt〉≤ 0
w(t+1)=w(t)+ytxt
else
w(t+1)=w(t)
To analyze the Perceptron, we rely on the analysis of Online Gradient De-
scent given in the previous section. In our case, the subgradient offtwe use
in the Perceptron isvt=− (^1) [yt〈w(t),xt〉≤0]ytxt. Indeed, the Perceptron’s update
isw(t+1) =w(t)−vt, and as discussed before this is equivalent tow(t+1)=
w(t)−ηvtfor everyη >0. Therefore, Theorem 21.15 tells us that
∑T
t=1
ft(w(t))−
∑T
t=1
ft(w?)≤
1
2 η
‖w?‖^22 +
η
2
∑T
t=1
‖vt‖^22.
Sinceft(w(t)) is a surrogate for the 0−1 loss we know that
∑T
t=1ft(w
(t))≥|M|.
DenoteR= maxt‖xt‖; then we obtain
|M|−
∑T
t=1
ft(w?)≤
1
2 η
‖w?‖^22 +
η
2
|M|R^2
Settingη= ‖w
?‖
R
√
|M|and rearranging, we obtain
|M|−R‖w?‖
√
|M|−
∑T
t=1
ft(w?)≤ 0. (21.6)
This inequality implies