Understanding Machine Learning: From Theory to Algorithms

(Jeff_L) #1
17.5 Bipartite Ranking and Multivariate Performance Measures 243

The following claim states that every doubly stochastic matrix is a convex
combination of permutation matrices.
claim17.3 ((Birkhoff 1946, Von Neumann 1953)) The set of doubly stochastic
matrices inRr,ris the convex hull of the set of permutation matrices inRr,r.
On the basis of the claim, we easily obtain the following:

lemma17.4 There exists an optimal solution of Equation (17.10) that is also
an optimal solution of Equation (17.9).
Proof LetBbe a solution of Equation (17.10). Then, by Claim 17.3, we can
writeB=


∑ iγiCi, where eachCiis a permutation matrix, eachγi>0, and
iγi = 1. Since all theCiare also doubly stochastic, we clearly have that
〈A, B〉≤〈A, Ci〉for everyi. We claim that there is someifor which〈A, B〉=
〈A, Ci〉. This must be true since otherwise, if for everyi〈A, B〉<〈A, Ci〉, we
would have that

〈A,B〉=


A,


i

γiCi


=


i

γi〈A,Ci〉>


i

γi〈A,B〉=〈A,B〉,

which cannot hold. We have thus shown that some permutation matrix,Ci,
satisfies〈A,B〉=〈A,Ci〉. But, since for every other permutation matrixCwe
have〈A,B〉 ≤ 〈A,C〉we conclude thatCiis an optimal solution of both Equa-
tion (17.9) and Equation (17.10).

17.5 Bipartite Ranking and Multivariate Performance Measures


In the previous section we described the problem of ranking. We used a vector
y∈Rrfor representing an order over the elementsx 1 ,...,xr. If all elements iny
are different from each other, thenyspecifies a full order over [r]. However, if two
elements ofyattain the same value,yi=yjfori 6 =j, thenycan only specify a
partial order over [r]. In such a case, we say thatxiandxjare of equal relevance
according toy. In the extreme case,y∈ {± 1 }r, which means that eachxiis
either relevant or nonrelevant. This setting is often called “bipartite ranking.” For
example, in the fraud detection application mentioned in the previous section,
each transaction is labeled as either fraudulent (yi= 1) or benign (yi=−1).
Seemingly, we can solve the bipartite ranking problem by learning a binary
classifier, applying it on each instance, and putting the positive ones at the top
of the ranked list. However, this may lead to poor results as the goal of a binary
learner is usually to minimize the zero-one loss (or some surrogate of it), while the
goal of a ranker might be significantly different. To illustrate this, consider again
the problem of fraud detection. Usually, most of the transactions are benign (say
99 .9%). Therefore, a binary classifier that predicts “benign” on all transactions
will have a zero-one error of 0.1%. While this is a very small number, the resulting
predictor is meaningless for the fraud detection application. The crux of the
Free download pdf