Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
pk.In other words, if you were given the probability distribution and someone
had to communicate to you which class was the one that actually occurred, this
is the number of bits that person would need to encode the information if they
did it as effectively as possible. (Of course, it is always possible to use morebits.)
Because probabilities are always less than one, their logarithms are negative, and
the minus sign makes the outcome positive. For example, in a two-class situa-
tion—heads or tails—with an equal probability of each class, the occurrence of
a head would take 1 bit to transmit, because -log 2 1/2 is 1.
The expected value of the informational loss function, if the true probabili-
ties are p* 1 ,p* 2 ,...,p*k,is

Like the quadratic loss function, this expression is minimized by choosing pj=
p*j, in which case the expression becomes the entropy of the true distribution:

Thus the informational loss function also rewards honesty in predictors that
know the true probabilities, and encourages predictors that do not to put
forward their best guess.
The informational loss function also has a gamblinginterpretation in which
you imagine gambling on the outcome, placing odds on each possible class and
winning according to the class that comes up. Successive instances are like suc-
cessive bets: you carry wins (or losses) over from one to the next. The logarithm
of the total amount of money you win over the whole test set is the value of the
informational loss function. In gambling, it pays to be able to predict the odds
as accurately as possible; in that sense, honesty pays, too.
One problem with the informational loss function is that if you assign a
probability of zero to an event that actually occurs, the function’s value is minus
infinity. This corresponds to losing your shirt when gambling. Prudent punters
never bet everythingon a particular event, no matter how certain it appears.
Likewise, prudent predictors operating under the informational loss function
do not assign zero probability to any outcome. This leads to a problem when
no information is available about that outcome on which to base a prediction:
this is called the zero-frequency problem,and various plausible solutions have
been proposed, such as the Laplace estimator discussed for Naïve Bayes on page
91.

Discussion

If you are in the business of evaluating predictions of probabilities, which of the
two loss functions should you use? That’s a good question, and there is no uni-
versally agreed-upon answer—it’s really a matter of taste. Both do the funda-

----pppp pp121 222*log **log * ... kk*log 2 *.

----pppp pp 121222 *log *log... kk*log 2.

160 CHAPTER 5| CREDIBILITY: EVALUATING WHAT’S BEEN LEARNED

Free download pdf