Social Media Mining: An Introduction

(Axel Boer) #1

P1: WQS Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-06 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 17:15


6.3 Community Evaluation 171

instance, in Figure6.15, the majority in Community 1 is×; therefore, we
assume majority label×for that community. The purity is then defined as
the fraction of instances that have labels equal to their community’s majority
label. Formally,

Purity=

1


N


∑k

i= 1

max
j

|Ci∩Lj|, (6.43)

wherekis the number of communities,Nis the total number of nodes,
Ljis the set of instances with labeljin all communities, andCiis the
set of members in communityi. In the case of our example, purity is
6 + 5 + 4
20 =^0 .75.

Normalized Mutual Information

Purity can be easily manipulated to generate high values; consider when
nodes represent singleton communities (of size 1) or when we have very
large pure communities (ground truth=majority label). In both cases,
purity does not make sense because it generates high values.
A more precise measure to solve problems associated with purity is the
normalized mutual information (NMI) measure, which originates in infor-
mation theory. Mutual information (MI) describes the amount of informa-
tion that two random variables share. In other words, by knowing one of
the variables, MI measures the amount of uncertainty reduced regarding
the other variable. Consider the case of two independent variables; in this
case, the mutual information is zero because knowing one does not help
in knowing more information about the other. Mutual information of two
variablesXandYis denoted asI(X,Y). We can use mutual information to
measure the information one clustering carries regarding the ground truth.
It can be calculated using Equation6.44, whereLandHare labels and
found communities;nhandnlare the number of data points in community
hand with labell, respectively;nh,lis the number of nodes in community
hand with labell; andnis the number of nodes.

MI=I(X,Y)=



h∈H


l∈L

nh,l
n

log

n·nh,l
nhnl

(6.44)


Unfortunately, mutual information is unbounded; however, it is common
for measures to have values in range [0,1]. To address this issue, we can
Free download pdf