P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23
110 Data Mining Essentials
The t f values are as follows:
social media mining data financial market
d 1 1 1 1 0 0 0
d 2 1 1 0 1 0 0
d 3 0 0 0 1 1 1
The id f values are
idfsocial=log 2 (3/2)= 0. 584 (5.9)
idfmedia=log 2 (3/2)= 0. 584 (5.10)
idfmining=log 2 (3/1)= 1. 584 (5.11)
idfdata=log 2 (3/2)= 0. 584 (5.12)
idffinancial=log 2 (3/1)= 1. 584 (5.13)
idfmarket=log 2 (3/1)= 1. 584. (5.14)
The TF-IDF values can be computed by multiplying t f values with the
id f values:
social media mining data financial market
d 1 0.584 0.584 1.584 0 0 0
d 2 0.584 0.584 0 0.584 0 0
d 3 0 0 0 0.584 1.584 1.584
After vectorization, documents are converted to vectors, and common
data mining algorithms can be applied. However, before that can occur, the
quality of data needs to be verified.
5.1.1 Data Quality
When preparing data for use in data mining algorithms, the following four
data quality aspects need to be verified:
- Noiseis the distortion of the data. This distortion needs to be removed
or its adverse effect alleviated before running data mining algorithms
because it may adversely affect the performance of the algorithms.
Many filtering algorithms are effective in combating noise effects. - Outliersare instances that are considerably different from other
instances in the dataset. Consider an experiment that measures the
average number of followers of users on Twitter. A celebrity with
many followers can easily distort the average number of followers per