Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


110 Data Mining Essentials

The t f values are as follows:
social media mining data financial market
d 1 1 1 1 0 0 0
d 2 1 1 0 1 0 0
d 3 0 0 0 1 1 1
The id f values are

idfsocial=log 2 (3/2)= 0. 584 (5.9)
idfmedia=log 2 (3/2)= 0. 584 (5.10)
idfmining=log 2 (3/1)= 1. 584 (5.11)
idfdata=log 2 (3/2)= 0. 584 (5.12)
idffinancial=log 2 (3/1)= 1. 584 (5.13)
idfmarket=log 2 (3/1)= 1. 584. (5.14)

The TF-IDF values can be computed by multiplying t f values with the
id f values:
social media mining data financial market
d 1 0.584 0.584 1.584 0 0 0
d 2 0.584 0.584 0 0.584 0 0
d 3 0 0 0 0.584 1.584 1.584

After vectorization, documents are converted to vectors, and common
data mining algorithms can be applied. However, before that can occur, the
quality of data needs to be verified.

5.1.1 Data Quality
When preparing data for use in data mining algorithms, the following four
data quality aspects need to be verified:


  1. Noiseis the distortion of the data. This distortion needs to be removed
    or its adverse effect alleviated before running data mining algorithms
    because it may adversely affect the performance of the algorithms.
    Many filtering algorithms are effective in combating noise effects.

  2. Outliersare instances that are considerably different from other
    instances in the dataset. Consider an experiment that measures the
    average number of followers of users on Twitter. A celebrity with
    many followers can easily distort the average number of followers per

Free download pdf