Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

10.3 FILTERING ALGORITHMS 393


method. Both are chosen in the usual way and configured with the object editor.
You must also decide which attribute to use as the class. Attribute selection can
be performed using the full training set or using cross-validation. In the latter
case it is done separately for each fold, and the output shows how many times—
that is, in how many of the folds—each attribute was selected. The results are
stored in the history list. When you right-click an entry here you can visualize
the dataset in terms of the selected attributes (choose Visualize reduced data).

Visualization

The Visualizepanel helps you visualize a dataset—not the result of a classifi-
cation or clustering model, but the dataset itself. It displays a matrix of two-
dimensional scatter plots of every pair of attributes. Figure 10.16(a) shows the
iris dataset. You can select an attribute—normally the class—for coloring the
data points using the controls at the bottom. If it is nominal, the coloring is dis-
crete; if it is numeric, the color spectrum ranges continuously from blue (low
values) to orange (high values). Data points with no class value are shown in
black. You can change the size of each plot, the size of the points, and the amount
of jitter, which is a random displacement applied to X and Y values to separate
points that lie on top of one another. Without jitter, 1000 instances at the same
data point would look just the same as 1 instance. You can reduce the size of
the matrix of plots by selecting certain attributes, and you can subsample the
data for efficiency. Changes in the controls do not take effect until the Update
button is clicked.
Click one of the plots in the matrix to enlarge it. For example, clicking on
the top left plot brings up the panel in Figure 10.16(b). You can zoom in on any
area of this panel by choosing Rectanglefrom the menu near the top right and
dragging out a rectangle on the viewing area like that shown. The Submitbutton
near the top left rescales the rectangle into the viewing area.

10.3 Filtering algorithms


Now we take a detailed look at the filtering algorithms implemented within
Weka. These are accessible from the Explorer, and also from the Knowledge Flow
and Experimenter interfaces described in Chapters 11 and 12. All filters trans-
form the input dataset in some way. When a filter is selected using the Choose
button, its name appears in the line beside that button. Click that line to get a
generic object editor to specify its properties. What appears in the line is the
command-line version of the filter, and the parameters are specified with minus
signs. This is a good way of learning how to use the Weka commands directly.
There are two kinds of filter: unsupervised and supervised (Section 7.2). This
seemingly innocuous distinction masks a rather fundamental issue. Filters are
Free download pdf