CS68: Lab 4 - Clustering

Overview

The objectives for the next two labs are to provide a sampling of common tasks that fall under traditional machine learning. Specifically, you will obtain experience and an understanding for:

parsing large biological data sets
implementing a clustering algorithm (Lab 4)
implementing a classification algorithm (Lab 5)
experimental methodology/evaluation metrics for machine learning methods
analyzing gene expression data

This week's lab will concentrate on the use of clustering to analyze yeast genome expression. In particular, you will implement the k-means clustering algorithm.

You may work with one partner on this lab. Be sure to indicate who you worked with in your submission (i.e., in the README, program header, an using the 'p' option for handin68). As usual, obtain this week's starting files using update68. You will hand in an implementation of k-means clustering, as well as analysis of your results in the results.pdf file. Labs will be due Sunday, November 2.

To get started, run update68 to obtain your starting files. Below is the lab organization, with files to modify highlighted in blue:

clustering.py - a library for general clustering algorithms and evaluation functions. You should implement the modular components of your algorithm here.
run_kmeans.py - your main program for reading in input arguments and calling the clustering methods.
input - contains input files. class.csv is the example from lecture slides with 4 data points. The other files are for the yeast expression data set (explained below)
experiments.pdf - your response to the experiment questions, below.
README - the usual

Implementation: clustering library

You will implement your clustering algorithm in clustering.py. It is common to build a library of common algorithms that you can easily reference in a main program. For example, your clustering.py file should allow you to implement Gaussian Mixture Models without having to reimplement common methods (e.g., distance functions, Silhouette scores, AIC). Your clustering.py file should contain functions for:

the central k-means algorithm
distance functions
evaluation metrics

K-means Algorithm

You should implement the standard k-means clustering algorithm discussed in lecture. Additional requirements/design choices:

You should utilize Manhattan distance to measure the distance between instances
You should run k-means until convergence, or 50 iterations, whichever comes first.
You should initialize each cluster by picking a random point from the data set. Also, if a cluster ever has 0 members, you should randomly assign the centroid to a random instance in the data set. Be sure that you don't assign the same initial location to two different cluster centers

Evaluation Metrics

Add functions for calculating the sum of squared errors (SSE), AIC, and Silhouette values as well as any other metrics you choose. You will call these methods from your main function (see below).

Implementation: main program

You should implement the main program in run_kmeans.py. This file should:

Read in and parse input files
Call clustering methods implemented in clustering.py
Output results to standard output and file

This file should be minimal in implementation.

Input

A user will enter three arguments on the command-line:

k, the number of clusters to use (integer)
the name of the file containing the gene expression profiles
(optional)the name of a file containing annotations about each instance. If this is not provided, instances should indexed by their line number in the original file.

For example, for the in-class example, the command-line run would be:

$ python run_kmeans.py 2 class.csv

For the yeast data set, however, we want to pair gene names with the profiles:

$ python run_kmeans.py 2 sample-yeast.csv sample-yeast-names.txt

If there are any errors in usage or with the files, your program should output a message and exit.

The data file will contain one instance per line and will be in comma separated format. There is a sample taken from class in class.csv.

For your main test, refer to the following paper:

Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS 1998 95 (25) 14863-14868

In particular, take a look at the publicly available data set and associated key for column headers. You have been provided two versions of this data in your lab directories. small-yeast.csv contains the expression values for 52 genes, one gene per line. You should use this for intermediate testing. full-yeast.csv contains the expression values for all 2467 genes in the original data set. Each file contains a companion gene identification file that contains the gene name and also the annotated functional assignment of that gene to help validate clusters. These two files (full-yeast-names.txt, sample-yeast-names.txt) contain the names of each gene which will help you identify the genes in a cluster. The list of gene names is an example of the optional command-line argument. It is not used in the actual algorithm, but is used when printing out cluster members to better understand the results.

A detailed omitted in class is that complexity penalization methods, such as AIC, assume the data is normalized. For this lab, you should normalize values by doing feature-length normalization. That is, for each column in your data, the length of the vector should equal 1. Do do this in practice is not difficult. You can follow this pseudocode:

Given: Data matrix, D, with n rows (one for each gene) and m columns (for each
experiment/feature)
Do: Length normalize features

for j=1..m #for each feature
  sum = 0
  for i=1..n #for each gene
    sum += D[i][j]**2 //square each value
  length = sqrt(sum) #Calculates length of vector, same as euclidean distance
  for i=1..n
    D[i][j] = D[i][j]/length #normalize values

Output

Your program should output the following:

In a file titled kmeans.out, write the number of clusters on the first line. After that, you should have k sections. The first line of each section should contain the value of the cluster centroid (the mean value learned). After that, print a list of the gene names for all genes that belong to that cluster, each on its own line. When you transition to a new cluster leave a blank line. If no gene name file is provided (for example, the class example) simply output the index of the instance in the original input file.
On standard output, indicate the sum of squared errors (SSE) across all clusters using the formula from class:

$\large \dpi{100} SSE = \sum_{j=1}^{k}\sum_{x_i \in c_j} dist(x_i, \mu_j)^2$
On standard output, indicate the AIC measure of error of your clustering results. As a reminder, AIC is equal to the sum of squared errors plus a penalty term:

$\large \dpi{100} AIC = SSE + 2 k M$
On standard output, indicate the Silhouette Index of your cluster result. The Silhouette index for one data point is:

$\dpi{100} s(i) = \frac{b(x_i) - a(x_i)}{\max\{a(x_i),b(x_i)\} }$

where b(x_i) is the average distance to members of the closest cluster (simply calculate the distance from x_i to all cluster means mu_j except the one x_i belongs to and select the smallest. Then find the average distance to all data members in this cluster) and a(x_i) is the average distance to the members of the same cluster (not including x_i). After calculating all s(i)s, the total Silhouette Index is:

$\large \dpi{100} S = \frac{1}{K} \sum_{j=1}^{K}\frac{1}{|c_j|}\sum_{x_i \in c_j}s(x_i)$

More details can be found here.

Sample Results

For the example from class:

$python run_kmeans.py 2 input/class.csv 

K-means with k=2

SSE:         0.07
AIC:         8.07
Silhouette:  0.86

The output file called kmeans.out should like this. For the small yeast example:

$ python run_kmeans.py 2 sample-yeast.csv sample-yeast-names.txt

K-means with k=2

SSE:      2629.30
AIC:      2945.30
Silhouette:  0.28

and the corresponding output file kmeans.out. For 5 clusters on the same data set:

$ python run_kmeans.py 5 sample-yeast.csv sample-yeast-names.txt

K-means with k=5

SSE:      1977.77
AIC:      2667.77
Silhouette: 0.33

and the output file kmeans.out.

NOTE: since k-means is locally optimal and there is randomness in the starting point, your results may vary. You should get similar results if you run your method a few times.

Experimental Analysis

For both the small and full yeast data set, test your program on a reasonable range of possible values for k. Then, in a separate file (experiments.pdf), provide the following:

For the small yeast data set, provide a table of results for each k value you tested. Each row should represent one value of k and each column represents the 3 metrics above (SSE, AIC, Silhouette Index).
In a paragraph, discuss what value(s) of k are would be best to choose and what measures lead you to that choice. Justify your choice using both the measures above as well as external validation (the names of each gene include a short description of their functional category). How useful are the validation metrics? Why do we need to normalize our input values to improve results?
Provide a paragraph description of how the Eisen et al. paper validated their results. That is, how did they justify their clustering algorithm as being informative?

Submitting your work

When you are finished with your lab, submit your code using handin68.

CS68 Lab 4: Gene Expression Analysis with Clustering

K-means Algorithm

Evaluation Metrics

Input

Output

Sample Results