Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

6.6 CLUSTERING 267


When the dataset is known in advance to contain correlated attributes, the
independence assumption no longer holds. Instead, two attributes can be
modeled jointly using a bivariate normal distribution, in which each has its own
mean value but the two standard deviations are replaced by a “covariance
matrix” with four numeric parameters. There are standard statistical techniques
for estimating the class probabilities of instances and for estimating the
means and covariance matrix given the instances and their class probabilities.
Several correlated attributes can be handled using a multivariate distribution.
The number of parameters increases with the square of the number of jointly
varying attributes. With nindependent attributes, there are 2nparameters, a
mean and a standard deviation for each. With ncovariant attributes, there are
n+n(n+1)/2 parameters, a mean for each and an n¥ncovariance matrix that
is symmetric and therefore involves n(n+1)/2 different quantities. This escala-
tion in the number of parameters has serious consequences for overfitting, as
we will explain later.
To cater for nominal attributes, the normal distribution must be abandoned.
Instead, a nominal attribute with vpossible values is characterized by vnumbers
representing the probability of each one. A different set of numbers is needed
for every class; kvparameters in all. The situation is very similar to the
Naïve Bayes method. The two steps of expectation and maximization corre-
spond exactly to operations we have studied before. Expectation—estimating
the cluster to which each instance belongs given the distribution parameters—
is just like determining the class of an unknown instance. Maximization—
estimating the parameters from the classified instances—is just like determin-
ing the attribute–value probabilities from the training instances, with the
small difference that in the EM algorithm instances are assigned to classes
probabilistically rather than categorically. In Section 4.2 we encountered
the problem that probability estimates can turn out to be zero, and the same
problem occurs here too. Fortunately, the solution is just as simple—use the
Laplace estimator.
Naïve Bayes assumes that attributes are independent—that is why it is called
“naïve.” A pair of correlated nominal attributes with v 1 and v 2 possible values,
respectively, can be replaced with a single covariant attribute with v 1 v 2 possible
values. Again, the number of parameters escalates as the number of dependent
attributes increases, and this has implications for probability estimates and over-
fitting that we will come to shortly.
The presence of both numeric and nominal attributes in the data to be clus-
tered presents no particular problem. Covariant numeric and nominal attrib-
utes are more difficult to handle, and we will not describe them here.
Missing values can be accommodated in various different ways. Missing
values of nominal attributes can simply be left out of the probability calcula-

Free download pdf