P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23
108 Data Mining Essentials
Ordinal. Ordinal features lay data on an ordinal scale. In other words,
the feature values have an intrinsic order to them. In our example,
Money Spentis an ordinal feature because aHighvalue forMoney
Spentis more than aLowone.
Interval. In interval features, in addition to their intrinsic ordering,
differences are meaningful whereas ratios are meaningless. For inter-
val features, addition and subtraction are allowed, whereas multipli-
cations and division are not. Consider two time readings: 6:16 PM
and 3:08 PM. The difference between these two time readings is
meaningful (3 hours and 8 minutes); however, there is no meaning to
6:16 PM
3:08 PM^ =2.
Ratio. Ratio features, as the name suggests, add the additional prop-
erties of multiplication and division. An individual’s income is an
example of a ratio feature where not only differences and additions
are meaningful but ratios also have meaning (e.g., an individual’s
income can be twice as much as John’s income).
In social media, individuals generate many types of nontabular data, such
as text, voice, or video. These types of data are first converted to tabular
data and then processed using data mining algorithms. For instance, voice
can be converted to feature values using approximation techniques such
as the fast Fourier transform (FFT) and then processed using data mining
algorithms. To convert text into the tabular format, we can use a process
denoted asvectorization. A variety of vectorization methods exist. A well-
known method for vectorization is thevector-space modelintroduced by
VECTORIZATIONSalton, Wong, and Yang [1975].
Vector Space Model
In the vector space model, we are given a set of documentsD. Each doc-
ument is a set of words. The goal is to convert these textual documents to
[feature] vectors. We can represent documentiwith vectordi,
di=(w 1 ,i,w 2 ,i,...,wN,i), (5.1)
wherewj,irepresents the weight for wordjthat occurs in documentiand
Nis the number of words used for vectorization.^2 To computewj,i,we
can set it to 1 when the wordjexists in documentiand 0 when it does