Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


132 Data Mining Essentials

the average distance value between instances in different clusters. In a well-
clustered dataset, the average distance between instances in the same cluster
is small (cohesiveness), and the average distance between instances in dif-
ferent clusters is large (separateness). Leta(x) denote the average distance
between instancexof clusterCand all other members ofC:

a(x)=

1


|C|− 1



y∈C,y =x

||x−y||^2. (5.60)

LetG =Cdenote the cluster that is closest toxin terms of the average
distance betweenxand members ofG. Letb(x) denote the average distance
between instancexand instances in clusterG:

b(x)=minG =C

1


|G|



y∈G

||x−y||^2. (5.61)

Since we want distance between instances in the same cluster to be
smaller than distance between instances in different clusters, we are inter-
ested ina(x)<b(x). The silhouette clustering index is formulated as

s(x)=

b(x)−a(x)
max(b(x),a(x))

, (5.62)


silhouette=

1


n


x

s(x). (5.63)

The silhouette index takes values between [−1, 1]. The best clustering
happens when∀xa(x)b(x). In this case,silhouette≈1. Similarly when
silhouette<0, that indicates that many instances are closer to other clusters
than their assigned cluster, which shows low-quality clustering.

Example 5.9.In Figure5.8, the a(.),b(.), and s(.)values are

a

(


x^11

)


=|− 10 −(−5)|^2 = 25 (5.64)


b

(


x^11

)


=


1


2


(|− 10 − 5 |^2 +|− 10 − 10 |^2 )= 312. 5 (5.65)


s

(


x^11

)


=


312. 5 − 25


312. 5


= 0. 92 (5.66)


a

(


x^12 )=|− 5 −(−10)|^2 = 25 (5.67)

b

(


x^12

)


=


1


2


(|− 5 − 5 |^2 +|− 5 − 10 |^2 )= 162. 5 (5.68)


s

(


x^12

)


=


162. 5 − 25


162. 5


= 0. 84 (5.69)

Free download pdf