Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1

246 Multiclass, Ranking, and Complex Prediction Problems


Once we have definedbas in Equation (17.13), we can easily derive a convex
surrogate loss as follows. Assuming thaty∈V, we have that

∆(hw(x ̄),y) = ∆(b(hw( ̄x)),y)

≤ ∆(b(hw( ̄x)),y) +

∑r

i=1

(bi(hw( ̄x))−yi)〈w,xi〉

≤ maxv∈V

[

∆(v,y) +

∑r

i=1

(vi−yi)〈w,xi〉

]

. (17.14)

The right-hand side is a convex function with respect tow.
We can now solve the learning problem using SGD as described in Section 17.2.5.
The main computational bottleneck is calculating a subgradient of the loss func-
tion, which is equivalent to findingvthat achieves the maximum in Equa-
tion (17.14) (see Claim 14.6).
In the following we describe how to find this maximizer efficiently for any
performance measure that can be written as a function of the numbersa, b, c, d
given in Equation (17.11), and for which the setVcontains all elements in{± 1 }r
for which the values ofa,bsatisfy some constraints. For example, for “recall at
k” the setVis all vectors for whicha+b≤k.
The idea is as follows. For anya,b∈[r], let

Y ̄a,b={v : |{i:vi= 1∧yi= 1}|=a∧ |{i:vi= 1∧yi=− 1 }|=b}.

Any vectorv∈V falls intoY ̄a,bfor somea,b∈[r]. Furthermore, ifY ̄a,b∩V
is not empty for somea,b∈[r] thenY ̄a,b∩V=Y ̄a,b. Therefore, we can search
within eachY ̄a,bthat has a nonempty intersection withV separately, and then
take the optimal value. The key observation is that once we are searching only
withinY ̄a,b, the value of ∆ is fixed so we only need to maximize the expression

max
v∈Y ̄a,b

∑r

i=1

vi〈w,xi〉.

Suppose the examples are sorted so that〈w,x 1 〉 ≥ ··· ≥ 〈w,xr〉. Then, it is
easy to verify that we would like to setvito be positive for the smallest indices
i. Doing this, with the constraint ona,b, amounts to settingvi= 1 for thea
top ranked positive examples and for thebtop-ranked negative examples. This
yields the following procedure.
Free download pdf