Lab 6: Classification
Due March 21 by midnight

Starting point code

You are encouraged to work with a partner on this lab. Please read the following instructions carefully.

  1. First you need to run setup63 to create a git repository for the lab.

    If you want to work alone do:

    setup63-Lisa labs/06 none
    If you want to work with a partner, then one of you needs to run the following command while the other one waits until it finishes:
    setup63-Lisa labs/06 partnerUsername
    Once the script finishes, the other partner should run it on their account.

  2. For the next step, only one partner should copy over the starting point code:
    cd ~/cs63/labs/06
    cp -r /home/bryce/public/cs63/labs/06/* .
    

  3. Whether you are working alone or with a partner, you should now add, commit, and push these files as shown below:
    git add *
    git commit -m "lab6 start"
    git push
    

  4. If you are working with a partner, your partner can now pull the changes in:
    cd ~/cs63/labs/06
    git pull
    
You are now ready to start the lab.

Introduction

In this lab you will be implementing two classification algorithms:

You will also be testing a third algorithm:

The only Python file you will need to modify is ClassificationModels.py. You have been provided with three data sets for testing the algorithms:

You should create small subsets of these data sets to use for incremental testing and debugging, and only run experiments on the full data sets when you are confident that you have implemented the algorithms correctly. I recommend starting with the house_votes data set as it has both the smallest number of training examples and the smallest input dimension.

Naive Bayes

The Naive Bayes classifier uses the training data to estimate the following probabilities:

The training data is combined with equiv_samples from a uniform prior according to equation 6.22 on p. 179 of Mitchell. An example is provided here.

To classify a test point, the probability P(label = l | x1 = v1, x2 = v2, ... ) is computed from these using Bayes rule. Technically Bayes rule also requires P(x1 = v1, x2 = v2, ... ), but because this same term appears in the denominator for all labels, it can be ignored. The class returned for a test point is the label with the highest likelyhood according to the naive Bayes estimate (from equation 6.20 on p. 177 of Mitchell):

P(label = l | x = v) = P(label = l) * P(x1 = v1 | label = l) * P(x2 = v2 | label = l) * ...
This estimate is naive in that all of the xi terms are treated as independent (and therefore multiplied together). Your task is to implement two methods:

I strongly recommend decomposing both of these functions into some helper functions that will be easier to debug. Some possible ideas for helper functions:

K Nearest Neighbors

The k nearest neighbors algorithm classifies test points according to a plurality vote over the closest training points. Like naive Bayes, you will implement train() and predict() functions, but in this case, the train() function will do very little work, shifting most of the computation time to predict(). The predict() function again needs to classify a new test point, and does so by identifying the k training points closest by Euclidian distance to the test point. In the case of a tie for the k-th nearest point, all tied points should be included in the neighbor set.

The most-common label across the k closest training points is returned as the predicted output. There are various possible methods for breaking ties, but the one we will use is to iteratively remove the most distant of the k points until the tie is broken. This is guaranteed to eventually return a unique label when the neighborhood gets down to a single point.

Again, I strongly recommend breaking down the predict() function into smaller tasks. One useful helper function might be neighbors(), which would return the set of >=k points closest to the test input.

Cross Validation Testing

For both of the algorithms you implemented and for the the support vector machine that has been provided, there is a free parameter that must be selected. One good way to do this is by n-fold cross validation, which you will implement in the Classifier parent class from which the various classifier implementations inherit. The idea behind cross validation is to try n distinct training set / test set splits. For each split, 1/n of the data is used as the test set and the remainder is used for training. For each parameter value under consideration, we can train and test n times, and select the parameter value with the lowest average error. You will do use cross validation to select among the following parameters for your experiments:

The optimal value for these parameters may vary across data sets. Because cross validation requires repeated training and testing of the model, it may be rather slow. If it's taking too long, first, try reducing the number of folds (if using >3). Then you can try running on a random subset of the data, though not as small a subset as you use for debugging.

The cross validation task has already been broken down slightly with the helper function test(), which should return a success rate on the portion of the data indicated as the test set (according to the member variable test_start). You are, of course, welcome to decompose the task further.

Experimental Comparison of Algorithms

Once you have chosen parameters for each algorithm, you should compare the different algorithms on each of the classification tasks. In the file experiments.txt, report the error rate of each classifier on each data set, as well as the parameters used. In addition, try to identify which data points are hard for each of the algorithms: is the classifier making a systematic mistake, or do the errors seem to be the result of random noise?

Submitting your code

Before submitting, ensure that you have added your name(s) to the top-level comments. To submit your code, you need to use git to add, commit, and push the files that you modified.
cd ~/cs63/labs/06
git add *.py
git commit -m "final version"
git push