Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


5.1 Data 109

not. We can also set it to the number of times the wordjis observed in
documenti. A more generalized approach is to use theterm frequency-
inverse document frequency (TF-IDF)weighting scheme. In the TF-IDF
scheme,wj,iis calculated as

wj,i=tfj,i×idfj, (5.2)

wheretfj,iis the frequency of wordjin documenti.idfjis the inverse

TF-IDF

frequency of wordjacross all documents,

idfj=log 2

|D|


|{document∈D|j∈document}|

, (5.3)


which is the logarithm of the total number of documents divided by the
number of documents that contain wordj. TF-IDF assigns higher weights
to words that are less frequent across documents and, at the same time, have
higher frequencies within the document they are used. This guarantees that
words with high TF-IDF values can be used as representative examples of
the documents they belong to and also, that stop words, such as “the,” which
are common in all documents, are assigned smaller weights.

Example 5.1.Consider the words “apple” and “orange” that appear 10
and 20 times in document d 1 .Let|D|= 20 and assume the word “apple”
only appears in document d 1 and the word “orange” appears in all 20
documents. Then, TF-IDF values for “apple” and “orange” in document
d 1 are

tf−idf(“apple”,d 1 )= 10 ×log 2

20


1


= 43. 22 , (5.4)


tf−idf(“orange”,d 1 )= 20 ×log 2

20


20


= 0. (5.5)


Example 5.2.Consider the following three documents:

d 1 =“social media mining” (5.6)
d 2 =“social media data” (5.7)
d 3 =“financial market data” (5.8)
Free download pdf