Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

230 Multiclass, Ranking, and Complex Prediction Problems


that even though the approximation error of the class of predictors of the form
h(x) = argmaxi〈wi,x〉is zero, the One-versus-All approach might fail to find a
good predictor from this class.

17.2 Linear Multiclass Predictors


In light of the inadequacy of reduction methods, in this section we study a more
direct approach for learning multiclass predictors. We describe the family of
linear multiclass predictors. To motivate the construction of this family, recall
that a linear predictor for binary classification (i.e., a halfspace) takes the form
h(x) = sign(〈w,x〉).

An equivalent way to express the prediction is as follows:
h(x) = argmax
y∈{± 1 }

〈w,yx〉,

whereyxis the vector obtained by multiplying each element ofxbyy.
This representation leads to a natural generalization of halfspaces to multiclass
problems as follows. Let Ψ :X ×Y →Rdbe aclass-sensitive feature mapping.
That is, Ψ takes as input a pair (x,y) and maps it into addimensional feature
vector. Intuitively, we can think of the elements of Ψ(x,y) as score functions that
assess how well the labelyfits the instancex. We will elaborate on Ψ later on.
Given Ψ and a vectorw∈Rd, we can define a multiclass predictor,h:X →Y,
as follows:
h(x) = argmax
y∈Y

〈w,Ψ(x,y)〉.

That is, the prediction ofhfor the inputxis the label that achieves the highest
weighted score, where weighting is according to the vectorw.
LetWbe some set of vectors inRd, for example,W={w∈Rd:‖w‖≤B},
for some scalarB >0. Each pair (Ψ,W) defines a hypothesis class of multiclass
predictors:
HΨ,W={x7→argmax
y∈Y

〈w,Ψ(x,y)〉 : w∈W}.

Of course, the immediate question, which we discuss in the sequel, is how to
construct a good Ψ. Note that ifY ={± 1 }and we set Ψ(x,y) =yxand
W =Rd, thenHΨ,W becomes the hypothesis class of homogeneous halfspace
predictors for binary classification.

17.2.1 How to Construct Ψ


As mentioned before, we can think of the elements of Ψ(x,y) as score functions
that assess how well the labelyfits the instancex. Naturally, designing a good Ψ
is similar to the problem of designing a good feature mapping (as we discussed in
Free download pdf