Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

10.3 FILTERING ALGORITHMS 401


it into a given number of cross-validation folds and reduce it to just one of them.
If a random number seed is provided, the dataset will be shuffled before the
subset is extracted.RemovePercentageremoves a given percentage of instances,
and RemoveRangeremoves a certain range of instance numbers. To remove all
instances that have certain values for nominal attributes, or numeric values
above or below a certain threshold, use RemoveWithValues. By default all
instances are deleted that exhibit one of a given set of nominal attribute values
(if the specified attribute is nominal) or a numeric value below a given thresh-
old (if it is numeric). However, the matching criterion can be inverted.
You can remove outliers by applying a classification method to the dataset
(specifying it just as the clustering method was specified previously for
AddCluster) and use RemoveMisclassified to delete the instances that it
misclassifies.


Sparse instances
The NonSparseToSparseand SparseToNonSparse filters convert between the
regular representation of a dataset and its sparse representation (see Section
2.4).


Supervised filters

Supervised filters are available from the Explorer’s Preprocesspanel, just as unsu-
pervised ones are. You need to be careful with them because, despite appear-
ances, they are not really preprocessing operations. We noted this previously
with regard to discretization—the test data splits must not use the test data’s
class values because these are supposed to be unknown—and it is true for super-
vised filters in general.
Because of popular demand, Weka allows you to invoke supervised filters as
a preprocessing operation, just like unsupervised filters. However, if you intend
to use them for classification you should adopt a different methodology. A meta-
learner is provided that invokes a filter in a way that wraps the learning algo-
rithm into the filtering mechanism. This filters the test data using the filter that
has been created by the training data. It is also useful for some unsupervised
filters. For example, in StringToWordVectorthe dictionary will be created from
the training data alone: words that are novel in the test data will be discarded.
To use a supervised filter in this way, invoke the FilteredClassifiermetalearning
scheme from in the metasection of the menu displayed by the Classifypanel’s
Choosebutton. Figure 10.17(a) shows the object editor for this metalearning
scheme. With it you choose a classifier and a filter. Figure 10.17(b) shows the
menu of filters.
Supervised filters, like unsupervised ones, are divided into attribute and
instance filters, listed in Table 10.3 and Table 10.4.

Free download pdf