Social Media Mining: An Introduction

(Axel Boer) #1

P1: Sqe Trim: 6.125in×9.25in Top: 0.5in Gutter: 0.75in
CUUS2079-05 CUUS2079-Zafarani 978 1 107 01885 3 January 13, 2014 19:23


5.8 Exercises 137


  1. Describe how the least square solution can be determined for
    regression.


Unsupervised Learning


  1. (a) Givenkclusters and their respective cluster sizess 1 ,s 2 ,...,sk,
    what is the probability that two random (with replacement) data
    vectors (from the clustered dataset) belong to the same cluster?
    (b) Now, assume you are given this probability (you do not havesi’s
    andk), and the fact that clusters are equally sized, can you findk?
    This gives you an idea how to predict the number of clusters in a
    dataset.

  2. Give an example of a dataset consisting of four data vectors where there
    exist two different optimal (minimum SSE) 2-means (k-means,k=2)
    clusterings of the dataset.
    Calculate the optimal SSE value for your example.
    In general, how should datasets look like geometrically so that we
    have more than one optimal solution?
    What defines the number of optimal solutions?


Perform two iterations of thek-means algorithm in order to obtain two
clusters for the input instances given in Table5.4. Assume that the first
centroids are instances 1 and 3. Explain if more iterations are needed
to get the final clusters.

Table 5.4. Dataset
Instance X Y
1 12.0 15.0
2 12.0 33.0
3 18.0 15.0
4 18.0 27.0
5 24.0 21.0
6 36.0 42.0

15.16. What is the usual shape of clusters generated byk-means? Give an
example of cases wherek-means has limitations in detecting the pat-
terns formed by the instances.


  1. Describe a preprocessing strategy that can help detect nonspherical
    clusters usingk-means.

Free download pdf