Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


114 Data Mining Essentials

Learning
Algorithm

Training
Set

Learn
Model

Apply
Model Model

Deduction

Test Set

Induction

Figure 5.2. Supervised Learning.

this set are tuples in the format (x,y), wherexis a vector andyis the class
attribute, commonly a scalar. Supervised learning builds a model that maps
xtoy. Roughly, our task is to find a mappingm(.) such thatm(x)=y.We
are also given an unlabeled dataset ortestdataset, in which instances are in
the form (x,?) andyvalues are unknown. Givenm(.) learned from training
data andxof an unlabeled instance, we can computem(x), the result of
which is prediction of the label for the unlabeled instance.
Consider the task of detecting spam emails. A set of emails is given where
users have manually identified spam versus non-spam (training data). Our
task is to use a set of features such as words in the email (x) to identify
the spam/non-spam status (y) of unlabeled emails (test data). In this case,
y={spam,non-spam}.
Supervised learning can be divided intoclassificationandregression.
When the class attribute is discrete, it is called classification; when the
class attribute is continuous, it is regression. We introduce classification
methods such asdecision tree learning,naive Bayes classifier,k-nearest
neighbor classifier, andclassification with network informationand regres-
sion methods such aslinear regressionandlogistic regression. We also
introduce how supervised learning algorithms are evaluated. Before we
delve into supervised learning techniques, we briefly discuss the systematic
process of a supervised learning algorithm.
This process is depicted in Figure5.2. It starts with a training set (i.e.,
labeled data) where both features and labels (class attribute values) are
known. A supervised learning algorithm is run on the training set in a pro-
cess known asinduction. In the induction process, themodelis generated.
The model maps the feature values to the class attribute values. The model
is used on atest setin which the class attribute value is unknown to predict
these unknown class attribute values (deductionprocess).
Free download pdf