Understanding Machine Learning: From Theory to Algorithms

230 Multiclass, Ranking, and Complex Prediction Problems

that even though the approximation error of the class of predictors of the form h(x) = argmaxi〈wi,x〉is zero, the One-versus-All approach might fail to find a good predictor from this class.

17.2 Linear Multiclass Predictors

In light of the inadequacy of reduction methods, in this section we study a more direct approach for learning multiclass predictors. We describe the family of linear multiclass predictors. To motivate the construction of this family, recall that a linear predictor for binary classification (i.e., a halfspace) takes the form h(x) = sign(〈w,x〉).

An equivalent way to express the prediction is as follows: h(x) = argmax y∈{± 1 }

〈w,yx〉,

whereyxis the vector obtained by multiplying each element ofxbyy. This representation leads to a natural generalization of halfspaces to multiclass problems as follows. Let Ψ :X ×Y →Rdbe aclass-sensitive feature mapping. That is, Ψ takes as input a pair (x,y) and maps it into addimensional feature vector. Intuitively, we can think of the elements of Ψ(x,y) as score functions that assess how well the labelyfits the instancex. We will elaborate on Ψ later on. Given Ψ and a vectorw∈Rd, we can define a multiclass predictor,h:X →Y, as follows: h(x) = argmax y∈Y

〈w,Ψ(x,y)〉.

That is, the prediction ofhfor the inputxis the label that achieves the highest weighted score, where weighting is according to the vectorw. LetWbe some set of vectors inRd, for example,W={w∈Rd:‖w‖≤B}, for some scalarB >0. Each pair (Ψ,W) defines a hypothesis class of multiclass predictors: HΨ,W={x7→argmax y∈Y

〈w,Ψ(x,y)〉 : w∈W}.

Of course, the immediate question, which we discuss in the sequel, is how to construct a good Ψ. Note that ifY ={± 1 }and we set Ψ(x,y) =yxand W =Rd, thenHΨ,W becomes the hypothesis class of homogeneous halfspace predictors for binary classification.

17.2.1 How to Construct Ψ

As mentioned before, we can think of the elements of Ψ(x,y) as score functions that assess how well the labelyfits the instancex. Naturally, designing a good Ψ is similar to the problem of designing a good feature mapping (as we discussed in

Understanding Machine Learning: From Theory to Algorithms

17.2 Linear Multiclass Predictors

17.2.1 How to Construct Ψ

Get our desktop app

Company

Features

Documentation

Resources