Social Media Mining: An Introduction

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23

130 Data Mining Essentials

on the initialkcentroids. This problem can be mitigated by runningk-means multiple times and selecting the clustering assignment that is observed most often or is more desirable based on an objective function, such as the squared error. Sincek-means assumes that instances that belong to the same cluster are the ones that found the cluster’s centroid closer than other centroids in the dataset, one can safely assume that all the instances inside a cluster fall into a hyper-sphere, with the centroid being its center. The radius for this hyper-sphere is defined based on the farthest instance inside this cluster. If, when clusters that need to be extracted are nonspherical (globular),k- means has problems detecting them. This problem can be addressed by a preprocessing step in which a transformation is performed on the dataset to solve this issue.

5.5.2 Unsupervised Learning Evaluation When clusters are found, there is a need to evaluate how accurately the task has been performed. When ground truth is available, we have prior knowledge of which instances should belong to which cluster, as discussed in Chapter 6 in detail. However, evaluating clustering is a challenge because ground truth is often not available. When ground truth is unavailable, we incorporate techniques that analyze the clusters found and describe the quality of clusters found. In particular, we can use techniques that measure cohesivenessorseparatenessof clusters.

Cohesiveness

In evaluating clustering, we are interested in clusters that exhibitcohesive- ness. In cohesive clusters, instances inside the clusters are close to each other. In statistical terms, this is equivalent to having a small standard devi- ation (i.e., being close to the mean value). In clustering, this translates to being close to the centroid of the cluster. So cohesiveness is defined as the distance from instances to the centroid of their respective clusters,

cohesiveness=

∑k

i= 1

∑n(i)

j= 1

||xij−ci||

2 , (5.56)

which is the squared distance error (also known as SSE) discussed pre- viously. Small values of cohesiveness denote highly cohesive clusters in which all instances are close to the centroid of the cluster. Example 5.7. Figure 5.8 shows a dataset of four one-dimensional instances. The instances are clustered into two clusters. Instances in cluster

Social Media Mining: An Introduction

Get our desktop app

Company

Features

Documentation

Resources