CS68: Lab 4 - Clustering

Overview

The objectives for the next two labs are to provide a sampling of common tasks that fall under traditional machine learning. Specifically, you will obtain experience and an understanding for:

parsing large biology data sets
implementing a clustering algorithm (Lab 4)
implementing a classification algorithm (Lab 5)
experimental methodology/evaluation metrics for machine learning methods
analyzing gene expression data

This week's lab will focus on the use of clustering to analyze yeast genome expression. In particular, you will implement the EM for Gaussian Mixtures clustering algorithm.

Getting Started

Find your git repo for this lab assignment with name format of Lab4-id1_id2 (id1 being replaced by the user id of you and your partner): CS68-S17

Clone your git repo with the lab 4 starting point files into your labs directory:

$ cd cs68/labs
$ git clone [the ssh url to your repo)

Then cd into your Lab4-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

clustering.py - a library for general clustering algorithms and evaluation functions. You should implement the modular components of your algorithm here.
gene_cluster.py - your main program for reading in input arguments and calling the clustering methods.
input - contains input files. class.csv is the example from lecture slides with 4 data points. The other files are for the yeast expression data set (explained below)
experiments.pdf - your response to the experiment questions, below.
README - the usual

Implementation: clustering library

We will separate out the general algorithm (e.g., EM, k-means, AIC evaluation, etc.) from the application specific tasks (parsing file; interpreting results). You will implement your clustering algorithm in clustering.py. clustering.py file should contain functions for:

the EM algorithm for estimating a k-Gaussian Mixture Model
evaluation metrics
any other helper methods you deem fit

Details for the EM Algorithm

Recall that you will need to estimate the parameters of the model - a mean, covariance, and mixing probability for each cluster. To initialize each of these:

the mean, mu, of each cluster should be initialized for each cluster by picking a random point from the data set. Ensure that the same point is not picked for multiple clusters (HINT: you can use the shuffle function from the random library to randomly shuffle a list and then just select the first k values).
the mixture probability, alpha, of each cluster should be uniform (i.e., 1/K).
the covariance, sigma, of each cluster, will be shared across all clusters. Simply take the variance of all of your data e.g.,
```
sigma = numpy.diag(numpy.cov(data.T))
```
Here, data has the rows as instances and columns as features. First, data.T transposes the data, then numpy.cov() returns a 2D covariance matrix of the data, and finally we keep only the diagonal of the matrix numpy.diag(). The output is a 1D array representing the variance of each feature in the data set. This is the only time sigma needs to be calculated.

Tips and requirements:

The likelihood of a point being generated by a cluster will be evaluated using a multivariate Gaussian distribution, as was done in class. Rather than implement the function, you can use the scipy.stats implementation of a multivariate normal distribution:
```
from scipy.stats import multivariate_normal

prob_x = multivariate_normal.pdf(x,mu,sigma)
```
where x is your data point, mu is the cluster mean, and sigma is the covariance. This is equivalent to N(x|mu, sigma) from your lecture notes.
You should run EM until convergence, or 50 iterations, whichever comes first.
I strongly recommend that you use numpy to handle your lists. You will need to do a lot of basic arithmetic on lists (e.g., sum elements, add two lists, normalize all values, find the max/argmax) that are simple function calls in numpy. For example, my EM implementation went from 30 lines to over 100 when I switched to just regular lists/dictionaries.
The first step of your loop (the E-step) should calculate the cluster membership for each point and cluster (i.e., h from lecture).
The M-step will update the means (as done in lecture) and the alphas (you will need to determine how to do this).
To check for convergence, calculate the total difference in mean values from one iteration to the next. If it exceeds some very small value (e.g., 0.0001), go to the next iteration.
Your function should return the cluster means, as well as a list or dictionary that specifies the most likely cluster membership for each point (e.g., an index value 1 through K). You can calculate this by taking the argmax h value for each point across all k clusters.

Evaluation Metrics

Add functions for calculating the sum of squared errors (SSE), AIC, and (optionally) Silhouette values as well as any other metrics you choose. You will call these methods from your main function (see below).

Implementation: main program

You should implement the main program in gene_cluster.py. This file should:

Read in and parse input files
Call clustering methods implemented in clustering.py
Output results to standard output and file

Almost all of the code in this file will involved reading/parsing the file and outputting results.

Usage

A user will enter three arguments on the command-line:

k, the number of clusters to use (integer)
the name of the data set
whether to load name identifiers for the rows (e.g., genes) (0 or 1). If the user enters 0, instances should indexed by their line number in the original file (e.g. x1, x2, x3). If 1, you will load dataset.names.

For example, for the in-class example, the command-line run would be:

$ python gene_cluster.py 2 kmeans_class 0

Your program sets k=2, and opens input/kmeans_class.csv for the data. There is no index file, so the identifier for row 1 is x1, etc. For the yeast data set, we want to pair gene names with the profiles:

$ python gene_cluster.py 2 sample-yeast 1

This will load the gene profiles from input/sample-yeast.csv and labels (gene names/functions) from input/sample-yeast.names. If there are any errors in usage or with the file names, your program should output a message and exit.

Input

The data file will contain one instance per line and will be in comma separated format. The data from class is in kmeans_class.csv.

For your main test, refer to the following paper:

Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS 1998 95 (25) 14863-14868

In particular, take a look at the publicly available data set and associated key for column headers. You have been provided two versions of this data in your lab directories. small-yeast.csv contains the expression values for 52 genes, one gene per line. You should use this for intermediate testing. full-yeast.csv contains the expression values for all 2467 genes in the original data set. Each file contains a companion gene identification file that contains the gene name and also the annotated functional assignment of that gene to help validate clusters. These two files (full-yeast.names, sample-yeast.names) contain the names of each gene which will help you identify the genes in a cluster. The list of gene names is an example of the optional command-line argument. Line 1 is the name/function of the gene represented in line 1 of the gene profile data (.csv). It is not used in the actual algorithm, but is used when printing out results to better understand the results.

A detailed omitted in class is that complexity penalization methods, such as AIC, assume the data is normalized. For this lab, you must normalize values by doing feature-length normalization. That is, for each column in your data, the length of the vector should equal 1. You can follow this pseudocode:

Given: Data matrix, D, with n rows (one for each gene) and m columns (for each
experiment/feature)
Do: Length normalize features

for j=1..m #for each feature
  sum = 0
  for i=1..n #for each gene
    sum += D[i][j]**2 //square each value
  length = sqrt(sum) #Calculates length of vector, same as euclidean distance
  for i=1..n
    D[i][j] = D[i][j]/length #normalize values

Output

Your program should output the following:

In a file titled dataset.out (e.g., sample-yeast.out), write the number of clusters on the first line. After that, you should have k sections. The first line of each section should contain the value of the cluster centroid (the mean value learned). After that, print a list of the gene names for all genes that belong to that cluster, each on its own line. When you transition to a new cluster leave a blank line. If no gene name file is provided (for example, the class example) simply output the index of the instance in the original input file.
On standard output, indicate the sum of squared errors (SSE) across all clusters using the formula from class:

$\large \dpi{100} SSE = \sum_{j=1}^{k}\sum_{x_i \in c_j} dist(x_i, \mu_j)^2$
On standard output, indicate the AIC measure of error of your clustering results. As a reminder, AIC is equal to the sum of squared errors plus a penalty term:

$\large \dpi{100} AIC = SSE + 2 k M$
(Optional) On standard output, indicate the Silhouette Index of your cluster result. The Silhouette index for one data point is:

$\dpi{100} s(i) = \frac{b(x_i) - a(x_i)}{\max\{a(x_i),b(x_i)\} }$

where b(x_i) is the average distance to members of the closest cluster (simply calculate the distance from x_i to all cluster means mu_j except the one x_i belongs to and select the smallest. Then find the average distance to all data members in this cluster) and a(x_i) is the average distance to the members of the same cluster (not including x_i). After calculating all s(i)s, the total Silhouette Index is:

$\large \dpi{100} S = \frac{1}{K} \sum_{j=1}^{K}\frac{1}{|c_j|}\sum_{x_i \in c_j}s(x_i)$

More details can be found here.

Sample Results

For the example from class:

$ python gene_cluster.py 2 kmeans_class 0

Fitting Gaussian Mixture Model using EM with k=2

SSE:         0.09
AIC:         8.09
Silhouette:  0.86

The output file called kmeans_class.out should look like this. (Note: due to the small data set size, you may hit a corner case where both clusters converge to the same mean. You can just rerun and get a better result). To help with debugging, use these intermediate values for the E-step and M-step after each iteration. For the small yeast example:

$ python gene_cluster.py 2 sample-yeast 1

Fitting Gaussian Mixture Model using EM with k=2

SSE:      2629.30
AIC:      2945.30
Silhouette:  0.28

and the corresponding output file sample-yeast.out. For 5 clusters on the same data set:

$ python gene_cluster.py 5 sample-yeast 1

Fitting Gaussian Mixture Model using EM with k=5

SSE:      1852.38
AIC:      2642.38
Silhouette:  0.36

and the output file sample-yeast.out.

NOTE: since EM is locally optimal and there is randomness in the starting point, your results may vary. You should get similar results if you run your method a few times and pick the one that looks best.

Experimental Analysis

For the small yeast data set, test your program on a reasonable range of possible values for k. Then, in a separate file (experiments.pdf), provide the following:

For the small yeast data set, provide a table of results for each k value you tested. Each row should represent one value of k and each column represents the metrics above (e.g., SSE, AIC, Silhouette Index). Since the results can be noisy due to local maximum, be sure to run each setting a few times. Choose either to take the average or the best result.
In a paragraph, describe what values did you find to be best for k, and justify your choice using both internal metrics (e.g., SSE and AIC) and external -- do the clusters in your output file look like good clusters? You may want to focus on the column of gene functions, which describe the purpose of each gene. How might you choose to set the number of clusters using only the gene names file?
Provide an additional paragraph description of how the Eisen et al. paper validated their results. That is, how did they justify their clustering algorithm as being informative? I do not want you to provide the details of their findings (e.g., "Cluster A had genes X,Y,Z") but rather a description of the methods they used and why they choose to use these methods.