Lab 9: Clustering

Lab 9: Clustering
Due April 10 by midnight

Introduction

In this lab you will explore unsupervised learning, specifically k-means clustering and agglomerative hierarchical clustering. You have been given a file called clustering.py that defines a base class with some common methods that read in data and calculate distance. You will be modifying the starting point files: hierarchical.py and kmeans.py.

Some sample data files have been provided. The ClusteringModel class expects the data to be in the following format:

A .points file should contain one data point per line, where each value is separated by white space.
An optional .labels file should contain one label per line, which will be interpreted as a string. Each label should be unique.
If labels are provided, the lengths of the two files should be equal.

The data provided includes:

digit_hiddens: This data was generated from a neural network that was trained to recognize hand-written digits (though a smaller data set and a simpler network than those you devised in lab 7). After training was complete, the weights were frozen, and the data was passed through the network again. For each digit, the hidden layer's activation was saved. These 10-dimension points are saved in the digit_hiddens.points file. There are 26 different examples of each digit hand-written by person a-z. The digit_hiddens.labels file has labels a0-a9, b0-b9, ..., z0-z9.
hiddens_subset: This data is a subset of the digits data, just including samples from individuals a-c.
random: This data is generated from a two-class mixture-of-uniform-distributions model.
small: This data is provided as a simple testing set to help your verify that your code is working properly. The three-dimensional points and their associated labels are shown below.
```
0.9 0.8 0.1 a1
0.9 0.1 0.6 b1
0.1 0.1 0.1 c1
0.8 0.9 0.2 a2
0.8 0.0 0.5 b2
0.0 0.0 0.2 c2
0.9 0.9 0.3 a3
0.2 0.0 0.0 c3
0.5 0.5 0.5 d1
```
The a points are close to (1,1,0); the b points are close to (1,0,0.5); the c points are close to (0,0,0); and the d point is an outlier. A "good" clustering of these points would group the a's, the b's, and the c's separately.

Once you have implemented the following clustering methods, you will also pick your own data set to explore with them and analyze the results using the latex template provided with the starting point.

Understanding the base class

Execute the base class defined in clustering.py. Make sure you understand what class variables it creates. Note that it converts the points into numpy arrays to more efficiently calculate Euclidean distances.

K-Means

You will implement k-means clustering in the kmeans.py file. The random data set will be particularly useful for testing k-means.

Implementing k-means

The k-means algorithm partitions the data into k clusters, where each cluster is defined by a center point. The goal is to minimize the distance between each centroid and the data points that are closest to it. The algorithm begins with an initial partitioning and then iteratively improves it until no further progress can be made.

The following data structures will be helpful in implementing the algorithm:

self.centroids
A dictionary that maps cluster names 'c1'–'ck' to the current center points represented as numpy arrays.
self.members
A dictionary that maps cluster names 'c1'–'ck' to a list of the points (as numpy arrays) that have been assigned to that cluster.
self.labels
A dictionary that maps points (converted to tuples) to the cluster name 'c1'–'ck' where they are currently assigned.

The following pseudocode describes the algorithm:

init k centroids to be random points from the data set
init members and labels to be empty dictionaries
while points change clusters:
   init each cluster's member list to be empty
   # E step
   for each point in the data set 
      assign point to the closest centroid, updating members and labels
   # M step
   for each cluster
      update the centroid to the average of its assigned points

There is a rare edge case that also needs handling, where one of the clusters will end up with no points assigned to it, and therefore no centroid. If this happens, you should re-initialize that centroid to a random data point, and run a new E step to reassign the data points.

Testing k-means on the small data set

Below is a trace of one execution of k-means on the small.points data that uses the showClusters method to print information about each cluster and its members. In the first iteration, points c3, a2, and a3 were selected as the initial centroids.

ITERATION 1
Current error: 2.8477734056
--------------------
Center: c2 Length: 4
Center point: 0.200 0.000 0.000 c3
0.100 0.100 0.100 c1
0.800 0.000 0.500 b2
0.000 0.000 0.200 c2
0.200 0.000 0.000 c3
--------------------
Center: c1 Length: 3
Center point: 0.800 0.900 0.200 a2
0.900 0.800 0.100 a1
0.800 0.900 0.200 a2
0.500 0.500 0.500 d1
--------------------
Center: c0 Length: 2
Center point: 0.900 0.900 0.300 a3
0.900 0.100 0.600 b1
0.900 0.900 0.300 a3

In the next iteration, the error drops, and several of the points have shifted clusters (a3, b2, and d1).

ITERATION 2
Current error: 2.72339291187
--------------------
Center: c2 Length: 3
Center point: 0.275 0.025 0.200
0.100 0.100 0.100 c1
0.000 0.000 0.200 c2
0.200 0.000 0.000 c3
--------------------
Center: c1 Length: 3
Center point: 0.733 0.733 0.267
0.900 0.800 0.100 a1
0.800 0.900 0.200 a2
0.900 0.900 0.300 a3
--------------------
Center: c0 Length: 3
Center point: 0.900 0.500 0.450
0.900 0.100 0.600 b1
0.800 0.000 0.500 b2
0.500 0.500 0.500 d1

In the final iteration error drops again, but none of the points have moved clusters, so the program ends.

ITERATION 3
Current error: 1.46750698081
--------------------
Center: c2 Length: 3
Center point: 0.100 0.033 0.100
0.100 0.100 0.100 c1
0.000 0.000 0.200 c2
0.200 0.000 0.000 c3
--------------------
Center: c1 Length: 3
Center point: 0.867 0.867 0.200
0.900 0.800 0.100 a1
0.800 0.900 0.200 a2
0.900 0.900 0.300 a3
--------------------
Center: c0 Length: 3
Center point: 0.733 0.200 0.533
0.900 0.100 0.600 b1
0.800 0.000 0.500 b2
0.500 0.500 0.500 d1
--------------------------------------------------
Centers have stabilized after 3 iterations
Final error: 1.46750698081

Remember that each run of k-means will be different. The algorithm is very sensitive to the initial conditions. The final assignments found in this particular run are the most common, but other assignments will also occur.

Testing k-means on the random data set

With two dimensional data we can plot the movement of the centroids and the assignment of points to clusters. Using the plotClusters method we can watch how the clusters evolve over time. After each iteration a plot will appear. The plot shows the centroids as x's and the members as circles. Each cluster is color coded. You'll need to close the plot to continue to the next iteration.

Below is one run with k=2 using the random data that required 8 iterations to stabilize. Notice that the green cluster was initially quite small and over time grew downwards.

Try different different values for k on the random data. Recall that one of the issues with using k-means is that it is not always clear how to appropriately set k. Larger k will reduce error, but may create clusters that are over-fitted to the data.

Be sure to discuss how you chose k when you applied k-means to your own data set.

Hierarchical Clustering

You will perform agglomerative clustering, constructing the tree from the bottom up. Let's walk through the steps using the small data set described above.

Use a dictionary that maps a cluster's label to the cluster's average point (average link). The cluster's label will represent the tree structure that has been created so far. The leaves will simply be the point labels. Internal nodes will be tuples of the form:

(clNUM, leftBranch, rightBranch)

Begin by placing each point from the data set in its own cluster. Given that there are nine points in this data set you will have nine initial clusters.
Calculate the distances between all possible pairs of clusters. Find the pair with the minimum distance. Remove each of these from the dictionary and form a new cluster with this pair adding it to the dictionary. The two closest clusters are a2 and a3. A new cluster called cl0 is formed. The value associated with this cluster is the average of the points from the two original clusters. The current set of cluster names is now:
```
('cl0', 'a3', 'a2')
a1
b1
b2
c3
c2
c1
d1
```
Repeat this process with the current set of clusters. The next two closest are b1 and b2. A new cluster called cl1 is added.
```
('cl0', 'a3', 'a2')
('cl1', 'b1', 'b2')
a1
c3
c2
c1
d1
```

Continuing again, we will have:

('cl0', 'a3', 'a2')
('cl1', 'b1', 'b2')
a1
('cl2', 'c2', 'c1')
c3
d1

The next iteration will merge a leaf with an existing cluster:

('cl3', ('cl0', 'a3', 'a2'), 'a1')
('cl1', 'b1', 'b2')
('cl2', 'c2', 'c1')
c3
d1

Iterating again we see that all of the original points have been merged with a cluster, except for d, which was the outlier.
```
('cl4', ('cl2', 'c2', 'c1'), 'c3')
('cl3', ('cl0', 'a3', 'a2'), 'a1')
('cl1', 'b1', 'b2')
d1
```

This merging process continues until only a single cluster remains in the dictionary.

('cl7', ('cl4', ('cl2', 'c2', 'c1'), 'c3'), ('cl6', ('cl3', ('cl0',
'a3', 'a2'), 'a1'), ('cl5', ('cl1', 'b1', 'b2'), 'd1')))

This tuple reflects the tree structure, but it can be hard to see the structure in this format. Using the plotTree method will generate a picture like this:

Implement this bottom-up process now in the file hierarchical.py. Calculating distances is the bottleneck of this process, so be sure to store all calculations in the self.distances dictionary. Then you can check this first whenever a distance is needed. If it's there, use it, if not generate it and store it.

Once you can successfully create a hierarchical cluster for the small data, try the subset data next, and finally the full digits data. Drawing the tree for the full data will take some time. Based on the final tree, which digits seem to be easier (and harder) for the network to classify?

Optional Extension

In lecture, we will consider several alternatives for measuring the similarity of (or distance between) two points. You are encouraged to test some of these alternatives. As a first step, try replacing Euclidean distance with Manhattan distance. Remember, Manhattan distance is equivalent to the 1-norm, so we can add a ManhattanDist function to clusteringModels.py that will look very similar to the EuclideanDist function that has already been defined. To use the new function, you'll need to replace the following line from UnsupervisedLearning.__init__:

self.dist = self.EuclideanDist  -->  self.dist = self.ManhattanDist

You should also be sure that your kmeans and hierarchical implementations call self.dist instead of self.EuclideanDist You can also try out clustering with other distance or similarity measures.

Explore a data set of your choice

Find a data set that you are interested in exploring using unsupervised learning. A good place to look for inspiration is the UCI Machine Learning Repository.

Apply your implementations of both hierarchical clustering and k-means to your chosen data set. This may require you to do some pre-processing to put the data into the necessary format.

Write up your findings in the provided LaTeX template lab9.tex. To compile it do: pdflatex lab9.tex. To view the result do: gnome-open lab9.pdf.

Submitting your code

To submit your code, you need to use git to add, commit, and push the files that you modified. Be sure to include the data files you analyzed.

cd ~/cs63/labs/09
git add *.py *.points *.labels
git commit -m "final version"
git push

Lab 9: ClusteringDue April 10 by midnight