Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

7.3 SOME USEFUL TRANSFORMATIONS 307


lent religious–scientific upheavals and painful reexamination of humankind’s
role in God’s universe.
Back to the dataset. Just as in these examples, there is nothing to stop you
transforming all the data points into a different coordinate system. But unlike
these examples, in data mining there often isa preferred coordinate system,
defined not by some external convention but by the very data itself. Whatever
coordinates you use, the cloud of points has a certain variance in each direc-
tion, indicating the degree of spread around the mean value in that direction.
It is a curious fact that if you add up the variances along each axis and then
transform the points into a different coordinate system and do the same there,
you get the same total variance in both cases. This is always true provided that
the coordinate systems are orthogonal,that is, each axis is at right angles to the
others.
The idea of principal components analysis is to use a special coordinate
system that depends on the cloud of points as follows: place the first axis in the
direction of greatest variance of the points to maximize the variance along that
axis. The second axis is perpendicular to it. In two dimensions there is no
choice—its direction is determined by the first axis—but in three dimensions
it can lie anywhere in the plane perpendicular to the first axis, and in higher
dimensions there is even more choice, although it is always constrained to be
perpendicular to the first axis. Subject to this constraint, choose the second axis
in the way that maximizes the variance along it. Continue, choosing each axis
to maximize its share of the remaining variance.
How do you do this? It’s not hard, given an appropriate computer program,
and it’s not hard to understand, given the appropriate mathematical tools. Tech-
nically—for those who understand the italicized terms—you calculate the
covariance matrixof the original coordinates of the points and diagonalizeit to
find the eigenvectors. These are the axes of the transformed space, sorted in order
ofeigenvalue—because each eigenvalue gives the variance along its axis.
Figure 7.5 shows the result of transforming a particular dataset with 10
numeric attributes, corresponding to points in 10-dimensional space. Imagine
the original dataset as a cloud of points in 10 dimensions—we can’t draw it!
Choose the first axis along the direction of greatest variance, the second per-
pendicular to it along the direction of next greatest variance, and so on. The
table gives the variance along each new coordinate axis in the order in which
the axes were chosen. Because the sum of the variances is constant regardless of
the coordinate system, they are expressed as percentages of that total. We call
axes componentsand say that each one “accounts for” its share of the variance.
Figure 7.5(b) plots the variance that each component accounts for against the
component’s number. You can use all the components as new attributes for data
mining, or you might want to choose just the first few, the principal components,

Free download pdf