Data Mining: Practical Machine Learning Tools and Techniques, Second Edition

(Brent) #1

10.3 FILTERING ALGORITHMS 399


attributes in a dataset into binary ones, replacing each attribute with kvalues
by kbinary attributes using a simple one-per-value encoding. Attributes that are
already binary are left untouched.NumericToBinaryconverts all numeric attrib-
utes into nominal binary ones (except the class, if set). If the value of the
numeric attribute is exactly 0, the new attribute will be 0, and if it is missing,
the new attribute will be missing; otherwise, the value of the new attribute will
be 1. These filters also skip the class attribute.
FirstOrdertakes a range ofNnumeric attributes and replaces them with
N-1 numeric attributes whose values are the differences between consecutive
attribute values from the original instances. For example, if the original attrib-
ute values were 3, 2, and 1, the new ones will be -1 and -1.


String conversion
A string attribute has an unspecified number of values.StringToNominalcon-
verts it to nominal with a set number of values. You should ensure that all string
values that will appear in potential test data are represented in the dataset.
StringToWordVectorproduces attributes that represent the frequency of each
word in the string. The set of words—that is, the new attribute set—is deter-
mined from the dataset. By default each word becomes an attribute whose value
is 1 or 0, reflecting that word’s presence in the string. The new attributes can be
named with a user-determined prefix to keep attributes derived from different
string attributes distinct.
There are many options that affect tokenization. Words can be formed from
contiguous alphabetic sequences or separated by a given set of delimiter char-
acters. They can be converted to lowercase before being added to the diction-
ary, or all words on a predetermined list of English stopwords can be ignored.
Words that are not among the top kwords ranked by frequency can be discarded
(slightly more than kwords will be retained if there are ties at the kth position).
If a class attribute has been assigned, the top kwords for each class will be kept.
The value of each word attribute reflects its presence or absence in the string,
but this can be changed. A count of the number of times the word appears in
the string can be used instead. Word frequencies can be normalized to give each
document’s attribute vector the same Euclidean length—this length is not
chosen to be 1, to avoid the very small numbers that would entail, but to be the
average length of all documents that appear as values of the original string
attribute. Alternatively, the frequencies fijfor word iin document jcan be trans-
formed using log (1 +fij) or the TF ¥IDF measure (Section 7.3).


Time series
Two filters work with time series data.TimeSeriesTranslatereplaces the values
of an attribute (or attributes) in the current instance with the equivalent value

Free download pdf