Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

pk.In other words, if you were given the probability distribution and someone had to communicate to you which class was the one that actually occurred, this is the number of bits that person would need to encode the information if they did it as effectively as possible. (Of course, it is always possible to use morebits.) Because probabilities are always less than one, their logarithms are negative, and the minus sign makes the outcome positive. For example, in a two-class situa- tion—heads or tails—with an equal probability of each class, the occurrence of a head would take 1 bit to transmit, because -log 2 1/2 is 1. The expected value of the informational loss function, if the true probabilities are p* 1 ,p* 2 ,...,p*k,is

Like the quadratic loss function, this expression is minimized by choosing pj= p*j, in which case the expression becomes the entropy of the true distribution:

Thus the informational loss function also rewards honesty in predictors that know the true probabilities, and encourages predictors that do not to put forward their best guess. The informational loss function also has a gamblinginterpretation in which you imagine gambling on the outcome, placing odds on each possible class and winning according to the class that comes up. Successive instances are like successive bets: you carry wins (or losses) over from one to the next. The logarithm of the total amount of money you win over the whole test set is the value of the informational loss function. In gambling, it pays to be able to predict the odds as accurately as possible; in that sense, honesty pays, too. One problem with the informational loss function is that if you assign a probability of zero to an event that actually occurs, the function’s value is minus infinity. This corresponds to losing your shirt when gambling. Prudent punters never bet everythingon a particular event, no matter how certain it appears. Likewise, prudent predictors operating under the informational loss function do not assign zero probability to any outcome. This leads to a problem when no information is available about that outcome on which to base a prediction: this is called the zero-frequency problem,and various plausible solutions have been proposed, such as the Laplace estimator discussed for Naïve Bayes on page 91.

Discussion

If you are in the business of evaluating predictions of probabilities, which of the two loss functions should you use? That’s a good question, and there is no uni- versally agreed-upon answer—it’s really a matter of taste. Both do the funda-

----pppp pp121 222*log **log * ... kk*log 2 *.

----pppp pp 121222 *log *log... kk*log 2.

160 CHAPTER 5| CREDIBILITY: EVALUATING WHAT’S BEEN LEARNED

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Discussion

Get our desktop app

Company

Features

Documentation

Resources