246 Multiclass, Ranking, and Complex Prediction Problems
Once we have definedbas in Equation (17.13), we can easily derive a convex
surrogate loss as follows. Assuming thaty∈V, we have that∆(hw(x ̄),y) = ∆(b(hw( ̄x)),y)≤ ∆(b(hw( ̄x)),y) +∑ri=1(bi(hw( ̄x))−yi)〈w,xi〉≤ maxv∈V[
∆(v,y) +∑ri=1(vi−yi)〈w,xi〉]
. (17.14)
The right-hand side is a convex function with respect tow.
We can now solve the learning problem using SGD as described in Section 17.2.5.
The main computational bottleneck is calculating a subgradient of the loss func-
tion, which is equivalent to findingvthat achieves the maximum in Equa-
tion (17.14) (see Claim 14.6).
In the following we describe how to find this maximizer efficiently for any
performance measure that can be written as a function of the numbersa, b, c, d
given in Equation (17.11), and for which the setVcontains all elements in{± 1 }r
for which the values ofa,bsatisfy some constraints. For example, for “recall at
k” the setVis all vectors for whicha+b≤k.
The idea is as follows. For anya,b∈[r], letY ̄a,b={v : |{i:vi= 1∧yi= 1}|=a∧ |{i:vi= 1∧yi=− 1 }|=b}.Any vectorv∈V falls intoY ̄a,bfor somea,b∈[r]. Furthermore, ifY ̄a,b∩V
is not empty for somea,b∈[r] thenY ̄a,b∩V=Y ̄a,b. Therefore, we can search
within eachY ̄a,bthat has a nonempty intersection withV separately, and then
take the optimal value. The key observation is that once we are searching only
withinY ̄a,b, the value of ∆ is fixed so we only need to maximize the expressionmax
v∈Y ̄a,b∑ri=1vi〈w,xi〉.Suppose the examples are sorted so that〈w,x 1 〉 ≥ ··· ≥ 〈w,xr〉. Then, it is
easy to verify that we would like to setvito be positive for the smallest indices
i. Doing this, with the constraint ona,b, amounts to settingvi= 1 for thea
top ranked positive examples and for thebtop-ranked negative examples. This
yields the following procedure.