Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


106 Data Mining Essentials

Data

Selection

Preprocessing

Preprocessed
Data

100
50
18 94

3

76

1

94

3

76

1

Low

Med

Hi

2 Low
Transformed
Data Patterns

Target Data

Transformation

Interpretation or
Evaluation

Knowledge

Data Mining

Figure 5.1. Knowledge Discovery in Databases (KDD) process.

To analyze social media, we can either collect this raw data or use
available repositories that host collected data from social media sites.^1
When collecting data, we can either use APIs provided by social media sites
for data collection or scrape the information from those sites. In either case,
these sites are often networks of individuals where one can perform graph
traversal algorithms to collect information from them. In other words, we
can start collecting information from a subset of nodes on a social network,
subsequently collect information from their neighbors, and so on. The data
collected this way needs to be represented in a unified format for analysis.
For instance, consider a set of tweets in which we are looking for common
patterns. To find patterns in these tweets, they need to be first represented
using a consistent data format. In the next section, we discuss data, its
representation, and its types.

5.1 Data
In the KDD process, data is represented in atabularformat. Consider the
example of predicting whether an individual who visits an online book
seller is going to buy a specific book. This prediction can be performed
by analyzing the individual’s interests and previous purchase history. For
instance, when John has spent a lot of money on the site, has bought similar
books, and visits the site frequently, it is likely for John to buy that specific
book. John is an example of aninstance. Instances are also calledpoints,
data points,orobservations.Adatasetconsists of one or more instances:
Free download pdf