CS68 Lab 8: Machine Learning for Gene Expression Analysis

Due by 11:59pm, Saturday May 4, 2013
Overview

The objects for this week's lab are to provide a sampling of common tasks that fall under traditional machine learning. Specifically, you will obtain experience and an understanding for:

Your lab will be broken down into two programs: clustering yeast genes based on their expression values over several experiments and classifying leukemia patients based on their expression values for a set of genes.


k-means clustering

For this portion of the lab, you will implement k-means clustering and test it on expression data set of the yeast genome. To be reusable in the future, you should create a library file clusterAlgos.py which should contain a general k-means formulation as well as functions to evaluate clusters based on various criteria. You should design a distance function. Your main program should be in a file named yeastCluster.py

Input

A user will enter three arguments on the command-line:
  1. k, the number of clusters to use (integer)
  2. the name of the file containing the instances
  3. (optional)the name of a file containing annotations about each instance. If this is not provided, instances should indexed by their line number in the original file.
The data file will contain one instance per line and will be in comma separated format. There is a sample taken from class in class.csv.

For your main test, refer to the following paper:

Michael B. Eisen, Paul T. Spellman, Patrick O. Brown, and David Botstein. Cluster analysis and display of genome-wide expression patterns. PNAS 1998 95 (25) 14863-14868

In particular, take a look at the publicly available data set and associated key for column headers. I have provided a pre-processed sample of the data in your update directories. small-yeast.csv contains the expression values for 52 genes, one gene per line while full-yeast.csv contains the expression values for all 2467 genes in the original data set. Each file contains a companion gene identification file that contains the gene name and also the annotated functional assignment of that gene to help validate clusters. These two files (full-yeast-names.txt, sample-yeast-names.txt) contain the names of each gene which will help you identify the genes in a cluster. The list of gene names is an example of the optional command-line argument. It is not used in the actual algorithm, but is used when printing out cluster members to better understand the results.

Algorithm

You should implement the standard k-means clustering algorithm discussed in lecture. Additional requirements/design choices:

Output

Your program should output the following:

Experimental methodology

For both the small and full yeast data set, test your program on a reasonable range of possible values for k. Then, in a separate file (results.{doc, pdf}), provide the following:

Sample Results

For the example from class:
$ python yeastCluster.py 2 class.csv
Clustering using k-means with k=2

SSE: 7.33
AIC: 15.33
Silhouette: 0.85
The output file called kmeans.out should like this. For the yeast example:
$ python yeastCluster.py 2 sample-yeast.csv sample-yeast-names.txt
Clustering using k-means with k=2

SSE: 37115.38
AIC: 37431.38
Silhouette: 0.36
and the corresponding output file kmeans.out.

NOTE: since k-means is locally optimal and there is randomness in the starting point, your results may vary. You should get similar results if you run your method a few times.


k-nearest neighbors

For this portion of the lab, you will implement the supervised learning algorithm k-nearest neighbors, and employ appropriate experimental methodology to estimate the accuracy of your algorithm. As with the above problem, you should use good design skills by creating a library for classification algorithms in the file classifiers.py and put your main program in the file leukemiaClassifier.py.

Input

A user will input two arguments to run your program in the following order:

  1. k, the number of neighbors to consider
  2. The file name of the training set.
I originally designed this lab to have you utilize a tuning set to pick k, but cut that out for brevity. If you choose to implement that as an extension, you can omit the first command-line argument.

Your data set comes from the paper referenced in class for using expression analysis to differentiate cancer patients:

T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science, 286(5439):531-537. 1999.

The data has been made available in your labs directory in the file titled leukemia.csv. Each line describes the expression values of one patient. Each column (except the last) is the expression value for one gene. Details can be found here. The last column is the class for the sample: -1 for acute lymphoblastic leukemia (ALL) and 1 for acute myeloid leukemia (AML).

Algorithm

You are to implement the standard k-nearest neighbors algorithm discussed in lecture. To guide your algorithm design, follow these tips/requirements:

Experimental Methodology

To provide an accurate estimate of classifier error, we must validate the algorithm on a separate set of data than used for training. As discussed in class, a typical methodology to accomplish this while utilizing all of the data is to use n-fold cross validation as your strategy. You should use 5 folds for this lab. You do not need to worry about using a tuning set as we will assume k is set by some external piece of information.

To perform proper cross-validation, each example should appear in the test set in exactly one fold, no more no less. When not in the test set, it should be in the training set for the fold. In addition, the relative size of each training/test set split should be about the same for each fold (since 72 is not evenly divisible by 5, 3 of your test sets will have 14 data points and 2 will have 15). Some methodologies also balance the ratio of positive to negative examples in the test set, but we'll skip that for this lab.

In Python, you should utilize the shuffle command. The following pseudocode should provide guidance. I assume the entire data set is in a two-dimensional list title origData.

generateCVData(numFolds, origData):
  indexList = list of numbers from 0 to N-1
  random.shuffle(indexList) #list of integers all shuffled up

  shuffledData = empty list
  for each index in indexList
     append origData[index] to shuffledData

  foldSize = N/numFolds

  for each fold k:
    leftEnd = k*foldSize
    rightEnd = (k+1)*foldSize
    testFold[k] = shuffledData[leftEnd:rightEnd]
    trainFold[k] = rest of shuffledData #items before left end and after right end

  return [trainFolds, testFolds]
Note that you'll have to deal with floats and ints for fold size and index range formulas. Also, be careful of Python's negative indexing for lists. Note that you will only need to run this function once as it generates all of the folds for you. There are many ways to perform cross validation, this is just one.

Output

You should maintain the cumulative counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across the folds. That is, for five fold cross-validation, sum the number of true positives for each fold to obtain the final true positive count. The output of your program should include the following measures of error:

Here, positive refers to AML (class 1) and negative refers to ALL (class -1). N is the total number of test examples which is simply the size of the input data set.

Sample Results

In your directory, you'll find a toy data set titled simple.csv. A sample run with k=2 and 5-fold cross-validation produces the following output:

$ python leukemiaClassifier.py 2 simple.csv
Classification using k-nearest neighbors with k=2 and 5 folds

Results:

Accuracy: 0.80
Sensitivity: 0.67
Specificity: 1.00
To help with debugging, take a look at the intermediate values produced by the algorithm. Note that randomization may produce a different ordering for your the folds. However, since 5-fold cross validation is the same as leave-one-out in this example, you should have the same exact results just in a shuffled ordered of folds.

A sample run with the leukemia data set would look similar to:

$ python leukemiaClassifier.py 5 leukemia.csv
Classification using k-nearest neighbors with k=5 and 5 folds

Results:

Accuracy: 0.93
Sensitivity: 0.86
Specificity: 0.98
Your results will differ as the folds are determined randomly.


Submitting your work

When you are finished with your lab, submit your code using handin68.