Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1
in some other (previous or future) instance.TimeSeriesDeltareplaces attribute
values in the current instance with the difference between the current value and
the value in some other instance. In both cases instances in which the time-
shifted value is unknown may be removed, or missing values may be used.

Randomizing
Other attribute filters degrade the data.AddNoisetakes a nominal attribute and
changes a given percentage of its values. Missing values can be retained or
changed along with the rest.Obfuscateanonymizes data by renaming the rela-
tion, attribute names, and nominal and string attribute values.RandomProjec-
tionprojects the dataset on to a lower-dimensional subspace using a random
matrix with columns of unit length (Section 7.3). The class attribute is not
included in the projection.

Unsupervised instance filters

Weka’s instance filters, listed in Table 10.2, affect all instances in a dataset rather
than all values of a particular attribute or attributes.

Randomizing and subsampling
You can Randomizethe order of instances in the dataset.Normalizetreats all
numeric attributes (excluding the class) as a vector and normalizes it to a given
length. You can specify the vector length and the norm to be used.
There are various ways of generating subsets of the data. Use Resampleto
produce a random sample by sampling with replacement or RemoveFoldsto split

400 CHAPTER 10 | THE EXPLORER


Table 10.2 Unsupervised instance filters.

Name Function

NonSparseToSparse Convert all incoming instances to sparse format (Section 2.4)
Normalize Treat numeric attributes as a vector and normalize it to a given
length
Randomize Randomize the order of instances in a dataset
RemoveFolds Output a specified cross-validation fold for the dataset
RemoveMisclassified Remove instances incorrectly classified according to a specified
classifier—useful for removing outliers
RemovePercentage Remove a given percentage of a dataset
RemoveRange Remove a given range of instances from a dataset
RemoveWithValues Filter out instances with certain attribute values
Resample Produce a random subsample of a dataset, sampling with
replacement
SparseToNonSparse Convert all incoming sparse instances into nonsparse format
Free download pdf