Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

234 Multiclass, Ranking, and Complex Prediction Problems



  • For eachy′ 6 =y, the difference between〈w,Ψ(x,y)〉and〈w,Ψ(x,y′)〉is larger
    than the loss of predictingy′instead ofy. The difference〈w,Ψ(x,y)〉−
    〈w,Ψ(x,y′)〉is also referred to as the “margin” (see Section 15.1).
    This is illustrated in the following figure:
    w
    Ψ(x,y)


Ψ(x,y′)

Ψ(x,y′′) ≥
∆(y,y
′)

∆(y,y≥

′′)

17.2.5 Multiclass SVM and SGD


Once we have defined the generalized hinge loss, we obtain a convex-Lipschitz
learning problem and we can apply our general techniques for solving such prob-
lems. In particular, the RLM technique we have studied in Chapter 13 yields the
multiclass SVM rule:
Multiclass SVM
input:(x 1 ,y 1 ),...,(xm,ym)
parameters:
regularization parameterλ > 0
loss function ∆ :Y ×Y →R+
class-sensitive feature mapping Ψ :X ×Y →Rd
solve:

min
w∈Rd

(

λ‖w‖^2 +

1

m

∑m

i=1

maxy′∈Y(∆(y′,yi) +〈w,Ψ(xi,y′)−Ψ(xi,yi)〉)

)

outputthe predictorhw(x) = argmaxy∈Y〈w,Ψ(x,y)〉

We can solve the optimization problem associated with multiclass SVM us-
ing generic convex optimization algorithms (or using the method described in
Section 15.5). Let us analyze the risk of the resulting hypothesis. The analysis
seamlessly follows from our general analysis for convex-Lipschitz problems given
in Chapter 13. In particular, applying Corollary 13.8 and using the fact that the
generalized hinge loss upper bounds the ∆ loss, we immediately obtain an analog
of Corollary 15.7:

corollary17.1 LetDbe a distribution overX ×Y, letΨ :X ×Y →Rd,
and assume that for allx∈ Xandy∈ Ywe have‖Ψ(x,y)‖ ≤ρ/ 2. LetB > 0.
Free download pdf