CS68 Lab 5: Gene Expression Analysis with Classification

Due by 11:59pm, Sunday April 2, 2017
Overview

This lab will continue to build on the themes from last week's lab:

This week's lab will have you use classification algorithms to predict prognosis for leukemia patients using gene expression data. In particular, you will implement k-nearest neighbors, a simple yet effective classification algorithm. You will employ appropriate experimental cross-validation for both model selection and to estimate generalization error. You should use good design skills by creating a library for classification algorithms in the file classifiers.py and put your main program in the file leukemiaPredictor.py. Note that there is a greater emphasis in this lab on methodology and analysis than there is on implementation (knn is fairly straightforward).

Getting Started

Find your git repo for this lab assignment with name format of Lab5-id1_id2 (id1 being replaced by the user id of you and your partner): CS68-S17

Clone your git repo with the lab 5 starting point files into your labs directory:

$ cd cs68/labs
$ git clone [the ssh url to your repo)
Then cd into your Lab5-id1_id2 subdirectory. You will have the following files (those in blue require your modification):


Implementation: classifier library

As with the previous lab, you will create a library of classifiers that you can call interchangeably in any application. This file, at at a minimum, should implement general algorithms including:

Algorithm

You are to implement the standard k-nearest neighbors algorithm discussed in lecture. To guide your algorithm design, follow these tips/requirements:


Implementation: main program

You should implement the main program in leukemiaPredictor.py. This program should

Input

A user will input two arguments to run your program in the following order:

  1. k, the number of neighbors to consider
  2. The file name of the training set.
An example run with k=3 on the simple example data set:
$ python leukemiaPredictor.py 3 input/simple.csv
I originally designed this lab to have you utilize a tuning set to pick k, but cut that out for brevity. If you choose to implement that as an extension, you can omit the first command-line argument. Please state that you did this at the top of your program

The provided leukemia data set comes from the paper referenced in class for using expression analysis to differentiate cancer patients:

T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, D. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Science, 286(5439):531-537. 1999.

The data is in the input directory labs in the file titled leukemia.csv. Each line describes the expression values of one patient. Each column (except the last) is the expression value for one gene. Details can be found here. If you are curious, the descriptors for the columns are also available in /home/soni/public/cs68/human_genes.txt. The last column is the class for the sample: -1 for acute lymphoblastic leukemia (ALL) and 1 for acute myeloid leukemia (AML).

Experimental Methodology

To provide an accurate estimate of classifier error, we must validate the algorithm on a separate set of data than used for training. As discussed in class, a typical methodology to accomplish this while utilizing all of the data is to use n-fold cross validation as your strategy. You should use 5 folds for this lab. You do not need to worry about using a tuning set as we will assume k is set by some external piece of information.

To perform proper cross-validation, each example should appear in the test set in exactly one fold, no more no less. When not in the test set, it should be in the training set for the fold. In addition, the relative size of each training/test set split should be about the same for each fold (since 72 is not evenly divisible by 5, 3 of your test sets will have 14 data points and 2 will have 15). Some methodologies also balance the ratio of positive to negative examples in the test set, but we'll skip that for this lab.

In Python, you should utilize the shuffle command. The following pseudocode should provide guidance. The argument origData is a 2-dimensional list where the first dimension refers to the data points and the second dimension are the features.

generateFolds(numFolds, origData):
  indexList = list of numbers from 0 to N-1
  random.shuffle(indexList) #list of integers all shuffled up

  shuffledData = empty list
  for each index in indexList
     append origData[index] to shuffledData #this rearranges the rows

  foldSize = N/numFolds #use float division

  for each fold i:
    leftEnd = i*foldSize #this needs to be int (round down)
    rightEnd = (i+1)*foldSize #this needs to be an int (round down)
    testFold[i] = shuffledData[leftEnd:rightEnd]
    trainFold[i] = rest of shuffledData #items before left end and after right end

  return [trainFolds, testFolds]
Note that you'll have to deal with floats and ints for fold size and index range formulas. Also, be careful of Python's negative indexing for lists. Note that you will only need to run this function once as it generates all of the folds for you. There are many ways to perform cross validation, this is just one. Be sure to implement this function in classifiers.py

Output

You should maintain the cumulative counts of true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) across the folds. As a reminder, a true positive is an example that is correctly classified as positive, while a false positive is a negative example incorrectly predicted to be positive. To aggregate for five fold cross-validation, simply sum the number of true positives for each fold to obtain the final true positive count (and likewise for the other measures). The output of your program should include the following measures of error, where N is the number of total predictions:

Here, positive refers to AML (class 1) and negative refers to ALL (class -1). N is the total number of test examples which is simply the size of the input data set. Sensitivity and specificity are common in clinical research as they can often be more informative than accuracy alone; they place more focus on analyzing the positive examples, which is important when studying rare diseases. They are related to precision and recall, which are the preferred metrics for the natural language community.

Sample Results

In your directory, you'll find a toy data set titled simple.csv. A sample run with k=2 and 5-fold cross-validation produces the following output:

$ python leukemiaClassifier.py 2 simple.csv
Classification using k-nearest neighbors with k=2 and 5 folds

Results:

Accuracy: 0.80
Sensitivity: 0.67
Specificity: 1.00
To help with debugging, take a look at the intermediate values produced by the algorithm. Note that randomization may produce a different ordering for your the folds. However, since 5-fold cross validation is the same as leave-one-out in this example, you should have the same exact results with the folds being in a different order.

A sample run with the leukemia data set would look similar to:

$ python leukemiaClassifier.py 5 leukemia.csv
Classification using k-nearest neighbors with k=5 and 5 folds

Results:

Accuracy: 0.93
Sensitivity: 0.86
Specificity: 0.98
Your results will differ as the folds are determined randomly, but the error rates should be similar.


Experimental Evaluation
In a file experiments.pdf file, briefly answer the address questions:

Selecting k

  1. Try out a few values of k (e.g., 1 to 10) and provide a table of results.

  2. What values of k seem to perform best?

  3. Why is this not a legitimate strategy for selecting the best k to deploy in your published model (how does this effect generalization error)?

Decision Trees

In WEKA, load the leukemia data (/home/soni/public/cs68/leukemiaPatients.arff) using the Explorer module. Under Classifier tab, hit Choose and then pick J48 under trees (this is an implementation of C4.5). Under test options, select 5-fold cross validation.
  1. Compare J48 under two settings: with pruning (default settings) and without pruning (change unpruned to True and minNumObj to 1). What happens to the size of the tree? Which is more accurate?

  2. How does J48 performance compare to your k-nearest neighbor algorithm? Briefly explain why you think one algorithm did better than the other (HINT: consider the data types of the features)?

  3. Look at the output tree in the output log. Which genes are deemed to be most influential for the pruned version? You can parse the result buffer or right-click the Result list item and select Visualize Tree

  4. Why might a biologist prefer decision trees over k-nearest neighbors even if the accuracy is lower for the decision tree?

  5. Let us try a third category of algorithm - ensemble classifiers. Hit Choose and then meta and then finally AdaBoostM1. Run using the default settings with 5-folds. What is the accuracy of the model?

  6. Look at the learned models in the log. How is this different than J48 in terms of the type of models learned? Why might this account for improvement in accuracy?

  7. There are many other validation measures than just accuracy. Pick one of F-Measure, ROC Area, and Confusion Matrix. Describe what the measure is and why it might be more interesting than just plain accuracy (TIP: each of those has a good Wikipedia entry to start off with).