CS68 Lab 5: Gene Expression Analysis with Classification

Due by 11:59pm, Sunday November 9, 2014
Overview

This lab will continue to build on the themes from last week's lab:

This weeks' lab will concentrate on the use of classification algorithms to predict prognosis for leukemia patients using gene expression data. In particular, you will implement k-nearest neighbors, a simple yet effective classification algorithm. You will employ appropriate experimental cross-validation for both model selection and to estimate generalization error. You should use good design skills by creating a library for classification algorithms in the file classifiers.py and put your main program in the file leukemiaPredictor.py.

You may work with one partner on this lab. Be sure to indicate who you worked with in your submission (i.e., in the README, program header, an using the 'p' option for handin68).

To get started, run update68 to obtain your starting files. You should see the following files (those marked in blue will be modified):


Implementation: classifier library

As with the previous lab, you will create a library of classifiers that you can call interchangeably in any application. This file, at at a minimum, should implement general algorithms including:

Algorithm

You are to implement the standard k-nearest neighbors algorithm discussed in lecture. To guide your algorithm design, follow these tips/requirements:


Implementation: main program

You should implement the main program in leukemiaPredictor.py. This program should

Input

A user will input two arguments to run your program in the following order:

  1. k, the number of neighbors to consider
  2. The file name of the training set.
An example run with k=3 on the simple example data set:
$ python leukemiaPredictor.py 3 input/simple.csv
I originally designed this lab to have you utilize a tuning set to pick k, but cut that out for brevity. If you choose to implement that as an extension, you can omit the first command-line argument. Please state that you did this at the top of your program

The provided leukemia data set comes from the paper referenced in class for using expression analysis to differentiate cancer patients:

T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science, 286(5439):531-537. 1999.

The data is in the input directory labs in the file titled leukemia.csv. Each line describes the expression values of one patient. Each column (except the last) is the expression value for one gene. Details can be found here. If you are curious, the descriptors for the columns are also available in /home/soni/public/cs68/human_genes.txt. The last column is the class for the sample: -1 for acute lymphoblastic leukemia (ALL) and 1 for acute myeloid leukemia (AML).

Experimental Methodology

To provide an accurate estimate of classifier error, we must validate the algorithm on a separate set of data than used for training. As discussed in class, a typical methodology to accomplish this while utilizing all of the data is to use n-fold cross validation as your strategy. You should use 5 folds for this lab. You do not need to worry about using a tuning set as we will assume k is set by some external piece of information.

To perform proper cross-validation, each example should appear in the test set in exactly one fold, no more no less. When not in the test set, it should be in the training set for the fold. In addition, the relative size of each training/test set split should be about the same for each fold (since 72 is not evenly divisible by 5, 3 of your test sets will have 14 data points and 2 will have 15). Some methodologies also balance the ratio of positive to negative examples in the test set, but we'll skip that for this lab.

In Python, you should utilize the shuffle command. The following pseudocode should provide guidance. I assume the entire data set is in a two-dimensional list origData.

generateFolds(numFolds, origData):
  indexList = list of numbers from 0 to N-1
  random.shuffle(indexList) #list of integers all shuffled up

  shuffledData = empty list
  for each index in indexList
     append origData[index] to shuffledData #this rearranges the rows

  foldSize = N/numFolds #use float division

  for each fold i:
    leftEnd = i*foldSize #this needs to be int (round down)
    rightEnd = (i+1)*foldSize #this needs to be an int (round down)
    testFold[i] = shuffledData[leftEnd:rightEnd]
    trainFold[i] = rest of shuffledData #items before left end and after right end

  return [trainFolds, testFolds]
Note that you'll have to deal with floats and ints for fold size and index range formulas. Also, be careful of Python's negative indexing for lists. Note that you will only need to run this function once as it generates all of the folds for you. There are many ways to perform cross validation, this is just one. Be sure to implement this function in classifiers.py

Output

You should maintain the cumulative counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across the folds. As a reminder, a true positive is an example that is correctly classified as positive, while a false positive is a negative example incorrectly predicted to be positive. To aggregate for five fold cross-validation, simply sum the number of true positives for each fold to obtain the final true positive count (and likewise for the other measures). The output of your program should include the following measures of error, where N is the number of total predictions:

Here, positive refers to AML (class 1) and negative refers to ALL (class -1). N is the total number of test examples which is simply the size of the input data set.

Sample Results

In your directory, you'll find a toy data set titled simple.csv. A sample run with k=2 and 5-fold cross-validation produces the following output:

$ python leukemiaClassifier.py 2 simple.csv
Classification using k-nearest neighbors with k=2 and 5 folds

Results:

Accuracy: 0.80
Sensitivity: 0.67
Specificity: 1.00
To help with debugging, take a look at the intermediate values produced by the algorithm. Note that randomization may produce a different ordering for your the folds. However, since 5-fold cross validation is the same as leave-one-out in this example, you should have the same exact results just in a shuffled ordered of folds.

A sample run with the leukemia data set would look similar to:

$ python leukemiaClassifier.py 5 leukemia.csv
Classification using k-nearest neighbors with k=5 and 5 folds

Results:

Accuracy: 0.93
Sensitivity: 0.86
Specificity: 0.98
Your results will differ as the folds are determined randomly, but the error rates should be similar.


Experimental Evaluation
In your README file, briefly answer the following questions:
  1. Play around with a few values of k (e.g., 1 to 10). What values of k seem to perform best?

  2. Why is this not a legitimate method for picking the best k to deploy in your published model? What will happen to generalization error estimates?

  3. In WEKA, load the leukemia data in /home/soni/public/cs68/leukemiaPatients.arff. Run the classifier J48, which is a variation on the decision tree algorithm from class. Use 5-fold cross validation to measure error. How does the performance compare to your k-nearest neighbor algorithm? Do you have an idea why this discrepancy exists (HINT: all features are continuous values, how does this influence both algorithms)?

  4. J48 performs pruning - cutting off some branches after training has completed. Why is this a good idea?

  5. Look at the output tree. Which genes seem to be most influential?

Submitting your work

When you are finished with your lab, submit your code using handin68.