Lab 6: Classification

Lab 6: Classification
Due March 21 by midnight

Starting point code

You are encouraged to work with a partner on this lab. Please read the following instructions carefully.

First you need to run setup63 to create a git repository for the lab.
If you want to work alone do:
```
setup63-Lisa labs/06 none
```
If you want to work with a partner, then one of you needs to run the following command while the other one waits until it finishes:
```
setup63-Lisa labs/06 partnerUsername
```
Once the script finishes, the other partner should run it on their account.
For the next step, only one partner should copy over the starting point code:
```
cd ~/cs63/labs/06
cp -r /home/bryce/public/cs63/labs/06/* .
```
Whether you are working alone or with a partner, you should now add, commit, and push these files as shown below:
```
git add *
git commit -m "lab6 start"
git push
```
If you are working with a partner, your partner can now pull the changes in:
```
cd ~/cs63/labs/06
git pull
```

You are now ready to start the lab.

Introduction

In this lab you will be implementing two classification algorithms:

naive Bayes
k nearest neighbors

You will also be testing a third algorithm:

support vector machines

The only Python file you will need to modify is ClassificationModels.py. You have been provided with three data sets for testing the algorithms:

optdigits: classification of handwritten digits
spambase: detecting spam emails. Note that there are two versions of this data set, with continuous and boolean inputs respectively.
house_votes: determining a congressperson's party by voting record

You should create small subsets of these data sets to use for incremental testing and debugging, and only run experiments on the full data sets when you are confident that you have implemented the algorithms correctly. I recommend starting with the house_votes data set as it has both the smallest number of training examples and the smallest input dimension.

Naive Bayes

The Naive Bayes classifier uses the training data to estimate the following probabilities:

P(label = l) for each possible l
P(xi = v | label = l) for each label l, input dimension xi, and value v for that dimension.

The training data is combined with equiv_samples from a uniform prior according to equation 6.22 on p. 179 of Mitchell. An example is provided here.

To classify a test point, the probability P(label = l | x1 = v1, x2 = v2, ... ) is computed from these using Bayes rule. Technically Bayes rule also requires P(x1 = v1, x2 = v2, ... ), but because this same term appears in the denominator for all labels, it can be ignored. The class returned for a test point is the label with the highest likelyhood according to the naive Bayes estimate (from equation 6.20 on p. 177 of Mitchell):

P(label = l | x = v) = P(label = l) * P(x1 = v1 | label = l) * P(x2 = v2 | label = l) * ...

This estimate is naive in that all of the xi terms are treated as independent (and therefore multiplied together). Your task is to implement two methods:

train() combines the data set with a uniform prior to estimate and store all of the required probabilities.
predict() uses the stored probability estimates to return a label for a test point.

I strongly recommend decomposing both of these functions into some helper functions that will be easier to debug. Some possible ideas for helper functions:

posterior() combines an empirical frequency with a prior according to equation 6.22 to return a posterior distribution.
label_prob() computes the probability estimate for a specific label conditional on a specific test point.

K Nearest Neighbors

The k nearest neighbors algorithm classifies test points according to a plurality vote over the closest training points. Like naive Bayes, you will implement train() and predict() functions, but in this case, the train() function will do very little work, shifting most of the computation time to predict(). The predict() function again needs to classify a new test point, and does so by identifying the k training points closest by Euclidian distance to the test point. In the case of a tie for the k-th nearest point, all tied points should be included in the neighbor set.

The most-common label across the k closest training points is returned as the predicted output. There are various possible methods for breaking ties, but the one we will use is to iteratively remove the most distant of the k points until the tie is broken. This is guaranteed to eventually return a unique label when the neighborhood gets down to a single point.

Again, I strongly recommend breaking down the predict() function into smaller tasks. One useful helper function might be neighbors(), which would return the set of >=k points closest to the test input.

Cross Validation Testing

For both of the algorithms you implemented and for the the support vector machine that has been provided, there is a free parameter that must be selected. One good way to do this is by n-fold cross validation, which you will implement in the Classifier parent class from which the various classifier implementations inherit. The idea behind cross validation is to try n distinct training set / test set splits. For each split, 1/n of the data is used as the test set and the remainder is used for training. For each parameter value under consideration, we can train and test n times, and select the parameter value with the lowest average error. You will do use cross validation to select among the following parameters for your experiments:

naive Bayes: equiv_samples, which determines how much weight is given to the uniform prior (relative to the training data). Try using values of 1, 10, and 100.
k nearest neighbors: k, the number of nearby points over which to perform the plurality vote. Try using values of 1, 3, and 5.
support vector machine: the kernel that is used to transform the input allowing for non-linear decision boundaries. Try using "linear", "poly", "rbf", and "sigmoid".

The optimal value for these parameters may vary across data sets. Because cross validation requires repeated training and testing of the model, it may be rather slow. If it's taking too long, first, try reducing the number of folds (if using >3). Then you can try running on a random subset of the data, though not as small a subset as you use for debugging.

The cross validation task has already been broken down slightly with the helper function test(), which should return a success rate on the portion of the data indicated as the test set (according to the member variable test_start). You are, of course, welcome to decompose the task further.

Experimental Comparison of Algorithms

Once you have chosen parameters for each algorithm, you should compare the different algorithms on each of the classification tasks. In the file experiments.txt, report the error rate of each classifier on each data set, as well as the parameters used. In addition, try to identify which data points are hard for each of the algorithms: is the classifier making a systematic mistake, or do the errors seem to be the result of random noise?

Submitting your code

Before submitting, ensure that you have added your name(s) to the top-level comments. To submit your code, you need to use git to add, commit, and push the files that you modified.

cd ~/cs63/labs/06
git add *.py
git commit -m "final version"
git push