Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


5.5 Unsupervised Learning 129

Algorithm 5.2k-Means Algorithm
Require:A Dataset of Real-Value Attributes,k(number of Clusters)
1: return A Clustering of Data intokClusters
2: Considerkrandom instances in the data space as the initial cluster
centroids.
3: whilecentroids have not convergeddo
4: Assign each instance to the cluster that has the closest cluster
centroid.
5: If all instances have been assigned then recalculate the cluster
centroids by averaging instances inside each cluster
6: end while

The algorithm starts withkinitial centroids. In practice, these centroids
are randomly chosen instances from the dataset. These initial instances form
the initial set ofkclusters. Then, we assign each instance to one of these
clusters based on its distance to the centroid of each cluster. The calculation
of distances from instances to centroids depends on the choice of distance
measure. Euclidean distance is the most widely used distance measure. After
assigning all instances to a cluster, the centroids, are recomputed by taking
the average (mean) of all instances inside the clusters (hence, the name
k-means). This procedure is repeated using the newly computed centroids.
Note that this procedure is repeated until convergence. The most common
criterion to determine convergence is to check whether centroids are no
longer changing. This is equivalent to clustering assignments of the data
instances stabilizing. In practice, the algorithm execution can be stopped
when the Euclidean distance between the centroids in two consecutive
steps is bounded above by some small positive. As an alternative,k-means
implementations try to minimize anobjective function. A well-known objec-
tive function in these implementations is the squared distance error:

∑k

i= 1

∑n(i)

j= 1

||xij−ci||^2 , (5.55)

wherexijis the jth instance of clusteri,n(i) is the number of instances
in clusteri, andci is the centroid of clusteri. The process stops when
the difference between the objective function values of two consecutive
iterations of thek-means algorithm is bounded by some small value.
Note thatk-means is highly sensitive to the initialkcentroids, and
different clustering results can be obtained on a single dataset depending
Free download pdf