P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23
130 Data Mining Essentials
on the initialkcentroids. This problem can be mitigated by runningk-means
multiple times and selecting the clustering assignment that is observed most
often or is more desirable based on an objective function, such as the squared
error. Sincek-means assumes that instances that belong to the same cluster
are the ones that found the cluster’s centroid closer than other centroids in
the dataset, one can safely assume that all the instances inside a cluster fall
into a hyper-sphere, with the centroid being its center. The radius for this
hyper-sphere is defined based on the farthest instance inside this cluster.
If, when clusters that need to be extracted are nonspherical (globular),k-
means has problems detecting them. This problem can be addressed by a
preprocessing step in which a transformation is performed on the dataset
to solve this issue.
5.5.2 Unsupervised Learning Evaluation
When clusters are found, there is a need to evaluate how accurately the
task has been performed. When ground truth is available, we have prior
knowledge of which instances should belong to which cluster, as discussed
in Chapter 6 in detail. However, evaluating clustering is a challenge because
ground truth is often not available. When ground truth is unavailable,
we incorporate techniques that analyze the clusters found and describe the
quality of clusters found. In particular, we can use techniques that measure
cohesivenessorseparatenessof clusters.
Cohesiveness
In evaluating clustering, we are interested in clusters that exhibitcohesive-
ness. In cohesive clusters, instances inside the clusters are close to each
other. In statistical terms, this is equivalent to having a small standard devi-
ation (i.e., being close to the mean value). In clustering, this translates to
being close to the centroid of the cluster. So cohesiveness is defined as the
distance from instances to the centroid of their respective clusters,
cohesiveness=
∑k
i= 1
∑n(i)
j= 1
||xij−ci||
2
, (5.56)
which is the squared distance error (also known as SSE) discussed pre-
viously. Small values of cohesiveness denote highly cohesive clusters in
which all instances are close to the centroid of the cluster.
Example 5.7. Figure 5.8 shows a dataset of four one-dimensional
instances. The instances are clustered into two clusters. Instances in cluster