Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


134 Data Mining Essentials

To evaluate supervised learning, a training-testing framework is used
in which the labeled dataset is partitioned into two parts, one for training
and the other for testing. Different approaches for evaluating supervised
learning such as leave-one-out ork-fold cross validation were discussed.
Any clustering algorithm requires the selection of a distance measure.
We discussed partitional clustering algorithms andk-means from these
algorithms, as well as methods of evaluating clustering algorithms. To
evaluate clustering algorithms, one can use clustering quality measures such
as cohesiveness, which measures how close instances are inside clusters, or
separateness, which measures how separate different clusters are from one
another. Silhouette index combines the cohesiveness and separateness into
one measure.

5.7 Bibliographic Notes
A general review of data mining algorithms can be found in the machine
learning and pattern recognition [Bishop, 2006;Duda, Hart, and Stork,
2012 ;Mitchell, 1997;Quinlan, 1986, 1993 ;Langley, 1995], data mining
[Friedman et al., 2009;Han et al., 2006;Witten et al., 2011;Tan et al.,
2005 ;Han et al., 2006], and pattern recognition [Bishop, 1995;Richard
et al., 2001] literature.
Among preprocessing techniques, feature selection and feature extrac-
tion have gained much attention due to their importance. General references
for feature selection and extraction can be found in [Liu and Motoda, 1998;
Dash and Liu, 1997, 2000 ;Guyon, 2006;Zhao and Liu, 2011;Liu and
Motoda, 1998;Liu and Yu, 2005]. Feature selection has also been discussed
in social media data in [Tang and Liu, 2012a,b, 2013 ]. Although not much
research is dedicated to sampling in social media, it plays an important
role in the experimental outcomes of social media research. Most experi-
ments are performed using sampled social media data, and it is important
for these samples to be representative samples of the site that is under
study. For instance,Morstatter et al. [2013] studied whether Twitter’s heav-
ily sampled Streaming API, a free service for social media data, accurately
portrays the true activity on Twitter. They show that the bias introduced by
the Streaming API is significant.
In addition to the data mining categories covered in this chapter, there
are other important categories in the area of data mining and machine
learning. In particular, an interesting category issemi-supervisedlearn-
ing. In semi-supervised learning, the label is available for some instances,
Free download pdf