Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

where the probabilities given the clusters A and B are determined from the normal distribution function f(x;m,s). This overall likelihood is a measure of the “goodness” of the clustering and increases at each iteration of the EM algorithm. Again, there is a technical difficulty with equating the probability of a particular value ofxwith f(x;m,s), and in this case the effect does not disap- pear because no probability normalization operation is applied. The upshot is that the preceding likelihood expression is not a probability and does not necessarily lie between zero and one: nevertheless, its magnitude still reflects the quality of the clustering. In practical implementations its logarithm is calculated instead: this is done by summing the logarithms of the individual components, avoiding all the multiplications. But the overall conclusion still holds: you should iterate until the increase in log-likelihood becomes negligi- ble. For example, a practical implementation might iterate until the difference between successive values of log-likelihood is less than 10-^10 for 10 successive iterations. Typically, the log-likelihood will increase very sharply over the first few iterations and then converge rather quickly to a point that is virtually stationary. Although the EM algorithm is guaranteed to converge to a maximum, this is a localmaximum and may not necessarily be the same as the global maximum. For a better chance of obtaining the global maximum, the whole proce- dure should be repeated several times, with different initial guesses for the parameter values. The overall log-likelihood figure can be used to compare the different final configurations obtained: just choose the largest of the local maxima.

Extending the mixture model

Now that we have seen the Gaussian mixture model for two distributions, let’s consider how to extend it to more realistic situations. The basic method is just the same, but because the mathematical notation becomes formidable we will not develop it in full detail. Changing the algorithm from two-class problems to multiclass problems is completely straightforward as long as the number kof normal distributions is given in advance. The model can be extended from a single numeric attribute per instance to multiple attributes as long as independence between attributes is assumed. The probabilities for each attribute are multiplied together to obtain the joint probability for the instance, just as in the Naïve Bayes method.

pxiipx i

’( ABPr[]A + Pr[]B),

266 CHAPTER 6| IMPLEMENTATIONS: REAL MACHINE LEARNING SCHEMES

Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

Extending the mixture model

Get our desktop app

Company

Features

Documentation

Resources