17.2 Linear Multiclass Predictors 235
Consider running Multiclass SVM withλ=
√
2 ρ^2
B^2 mon a training setS∼ D
m
and lethwbe the output of Multiclass SVM. Then,
E
S∼Dm
[L∆D(hw)] ≤ E
S∼Dm
[LgD−hinge(w)] ≤ min
u:‖u‖≤B
LgD−hinge(u) +
√
8 ρ^2 B^2
m
,
whereL∆D(h) =E(x,y)∼D[∆(h(x),y)]andLgD−hinge(w) =E(x,y)∼D[(w,(x,y))] with
being the generalized hinge-loss as defined in Equation (17.3).
We can also apply the SGD learning framework for minimizingLgD−hinge(w) as
described in Chapter 14. Recall Claim 14.6, which dealt with subgradients of max
functions. In light of this claim, in order to find a subgradient of the generalized
hinge loss all we need to do is to findy∈ Ythat achieves the maximum in the
definition of the generalized hinge loss. This yields the following algorithm:
SGD for Multiclass Learning
parameters:
Scalarη >0, integerT > 0
loss function ∆ :Y ×Y →R+
class-sensitive feature mapping Ψ :X ×Y →Rd
initialize: w(1)= 0 ∈Rd
fort= 1, 2 ,...,T
sample (x, y)∼D
find ˆy∈argmaxy′∈Y
(
∆(y′,y) +〈w(t),Ψ(x,y′)−Ψ(x,y)〉
)
setvt= Ψ(x,yˆ)−Ψ(x,y)
updatew(t+1)=w(t)−ηvt
outputw ̄=T^1
∑T
t=1w(t)
Our general analysis of SGD given in Corollary 14.12 immediately implies:
corollary17.2 LetDbe a distribution overX ×Y, letΨ :X ×Y →Rd,
and assume that for allx∈ Xandy∈ Ywe have‖Ψ(x,y)‖ ≤ρ/ 2. LetB > 0.
Then, for every > 0 , if we run SGD for multiclass learning with a number of
iterations (i.e., number of examples)
T≥
B^2 ρ^2
^2
and withη=
√
B^2
ρ^2 T, then the output of SGD satisfies
S∼DEm[L∆D(hw ̄)] ≤ S∼DEm[LgD−hinge(w ̄)] ≤ min
u:‖u‖≤B
LgD−hinge(u) +.
Remark 17.3 It is interesting to note that the risk bounds given in Corol-
lary 17.1 and Corollary 17.2 do not depend explicitly on the size of the label
setY, a fact we will rely on in the next section. However, the bounds may de-
pend implicitly on the size ofYvia the norm of Ψ(x,y) and the fact that the
bounds are meaningful only when there exists some vectoru,‖u‖≤B, for which
LgD−hinge(u) is not excessively large.