Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
17.2 Linear Multiclass Predictors 235

Consider running Multiclass SVM withλ=



2 ρ^2
B^2 mon a training setS∼ D

m

and lethwbe the output of Multiclass SVM. Then,


E
S∼Dm

[L∆D(hw)] ≤ E
S∼Dm

[LgD−hinge(w)] ≤ min
u:‖u‖≤B

LgD−hinge(u) +


8 ρ^2 B^2
m

,

whereL∆D(h) =E(x,y)∼D[∆(h(x),y)]andLgD−hinge(w) =E(x,y)∼D[(w,(x,y))] withbeing the generalized hinge-loss as defined in Equation (17.3).


We can also apply the SGD learning framework for minimizingLgD−hinge(w) as
described in Chapter 14. Recall Claim 14.6, which dealt with subgradients of max
functions. In light of this claim, in order to find a subgradient of the generalized
hinge loss all we need to do is to findy∈ Ythat achieves the maximum in the
definition of the generalized hinge loss. This yields the following algorithm:


SGD for Multiclass Learning
parameters:
Scalarη >0, integerT > 0
loss function ∆ :Y ×Y →R+
class-sensitive feature mapping Ψ :X ×Y →Rd
initialize: w(1)= 0 ∈Rd
fort= 1, 2 ,...,T
sample (x, y)∼D
find ˆy∈argmaxy′∈Y

(

∆(y′,y) +〈w(t),Ψ(x,y′)−Ψ(x,y)〉

)

setvt= Ψ(x,yˆ)−Ψ(x,y)
updatew(t+1)=w(t)−ηvt
outputw ̄=T^1

∑T

t=1w(t)
Our general analysis of SGD given in Corollary 14.12 immediately implies:

corollary17.2 LetDbe a distribution overX ×Y, letΨ :X ×Y →Rd,
and assume that for allx∈ Xandy∈ Ywe have‖Ψ(x,y)‖ ≤ρ/ 2. LetB > 0.
Then, for every > 0 , if we run SGD for multiclass learning with a number of
iterations (i.e., number of examples)


T≥

B^2 ρ^2
^2

and withη=



B^2
ρ^2 T, then the output of SGD satisfies

S∼DEm[L∆D(hw ̄)] ≤ S∼DEm[LgD−hinge(w ̄)] ≤ min
u:‖u‖≤B

LgD−hinge(u) +.

Remark 17.3 It is interesting to note that the risk bounds given in Corol-
lary 17.1 and Corollary 17.2 do not depend explicitly on the size of the label
setY, a fact we will rely on in the next section. However, the bounds may de-
pend implicitly on the size ofYvia the norm of Ψ(x,y) and the fact that the
bounds are meaningful only when there exists some vectoru,‖u‖≤B, for which
LgD−hinge(u) is not excessively large.

Free download pdf