Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

5.6 PREDICTING PROBABILITIES 159


where iis the correct class. When the test set contains several instances, the loss
function is summed over them all.
It is an interesting theoretical fact that if you seek to minimize the value of
the quadratic loss function in a situation in which the actual class is generated
probabilistically, the best strategy is to choose for the pvector the actual prob-
abilities of the different outcomes, that is,pi=Pr[class =i]. If the true proba-
bilities are known, they will be the best values for p.If they are not, a system
that strives to minimize the quadratic loss function will be encouraged to use
its best estimate of Pr[class =i] as the value for pi.
This is quite easy to see. Denote the true probabilities by p 1 ,p 2 ,...,pkso that
p
i=Pr[class =i]. The expected value of the quadratic loss function for a test
instance can be rewritten as follows:


The first stage just involves bringing the expectation inside the sum and expand-
ing the square. For the second,pjis just a constant and the expected value ofaj
is simply pj; moreover, because ajis either 0 or 1,a^2 j=ajand its expected value
is p
jtoo. The third stage is straightforward algebra. To minimize the resulting
sum, it is clear that it is best to choose pj=p*jso that the squared term disap-
pears and all that is left is a term that is just the variance of the true distribu-
tion governing the actual class.
Minimizing the squared error has a long history in prediction problems. In
the present context, the quadratic loss function forces the predictor to be honest
about choosing its best estimate of the probabilities—or, rather, it gives prefer-
ence to predictors that are able to make the best guess at the true probabilities.
Moreover, the quadratic loss function has some useful theoretical properties that
we will not go into here. For all these reasons it is frequently used as the crite-
rion of success in probabilistic prediction situations.


Informational loss function

Another popular criterion for the evaluation of probabilistic prediction is the
informational loss function:


where the ith prediction is the correct one. This is in fact identical to the nega-
tive of the log-likelihood function that is optimized by logistic regression,
described in Section 4.6. It represents the information (in bits) required to
express the actual class iwith respect to the probability distribution p 1 ,p 2 ,...,



  • log 2 pi


Epa EpEpaEa

pppp pp p p

j jj j jjjj

j jjjjj jj jj

[]( - ) = ( []- []+ [])

=-+=-+-

ÂÂ


ÂÂ


(^222)
22
2
( 21 *)(() ( )).

Free download pdf