Social Media Mining: An Introduction

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23

108 Data Mining Essentials

Ordinal. Ordinal features lay data on an ordinal scale. In other words, the feature values have an intrinsic order to them. In our example, Money Spentis an ordinal feature because aHighvalue forMoney Spentis more than aLowone. Interval. In interval features, in addition to their intrinsic ordering, differences are meaningful whereas ratios are meaningless. For interval features, addition and subtraction are allowed, whereas multipli- cations and division are not. Consider two time readings: 6:16 PM and 3:08 PM. The difference between these two time readings is meaningful (3 hours and 8 minutes); however, there is no meaning to 6:16 PM 3:08 PM^ =2. Ratio. Ratio features, as the name suggests, add the additional prop- erties of multiplication and division. An individual’s income is an example of a ratio feature where not only differences and additions are meaningful but ratios also have meaning (e.g., an individual’s income can be twice as much as John’s income).

In social media, individuals generate many types of nontabular data, such as text, voice, or video. These types of data are first converted to tabular data and then processed using data mining algorithms. For instance, voice can be converted to feature values using approximation techniques such as the fast Fourier transform (FFT) and then processed using data mining algorithms. To convert text into the tabular format, we can use a process denoted asvectorization. A variety of vectorization methods exist. A well- known method for vectorization is thevector-space modelintroduced by VECTORIZATIONSalton, Wong, and Yang [1975].

Vector Space Model

In the vector space model, we are given a set of documentsD. Each doc- ument is a set of words. The goal is to convert these textual documents to [feature] vectors. We can represent documentiwith vectordi,

di=(w 1 ,i,w 2 ,i,...,wN,i), (5.1)

wherewj,irepresents the weight for wordjthat occurs in documentiand Nis the number of words used for vectorization.^2 To computewj,i,we can set it to 1 when the wordjexists in documentiand 0 when it does

Social Media Mining: An Introduction

Get our desktop app

Company

Features

Documentation

Resources