Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
and discard the rest. In this case, three principal components account for 84%
of the variance in the dataset; seven account for more than 95%.
On numeric datasets it is common to use principal components analysis
before data mining as a form of data cleanup and attribute generation. For
example, you might want to replace the numeric attributes with the principal
component axes or with a subset of them that accounts for a given proportion—
say, 95%—of the variance. Note that the scale of the attributes affects the

308 CHAPTER 7| TRANSFORMATIONS: ENGINEERING THE INPUT AND OUTPUT


Axis

1 2 3 4 5 6 7 8 9

10

61.2%
18.0%
4.7%
4.0%
3.2%
2.9%
2.0%
1.7%
1.4%
0.9%

61.2%
79.2%
83.9%
87.9%
91.1%
94.0%
96.0%
97.7%
99.1%
100%

Variance Cumulative

(a)

percentage of variance

70%

60%

50%

40%

30%

20%

10%

0%
1 2 3 4 5 6 7 8 9 10
component number
(b)
Figure 7.5Principal components transform of a dataset: (a) variance of each compo-
nent and (b) variance plot.
Free download pdf