Lab 8: Classification
Due April 3 by midnight

Introduction

In this lab you will be implementing two classification algorithms:

You will also be testing a third algorithm:

The only Python file you will need to modify is ClassificationModels.py. You will be provided with three data sets for testing the algorithms:

For each data set, there are three files. The .summary file is a description of the data set that may be interesting to look over, but will not be used by our classifiers. The .data file contains the actual data set to be learned. Each line of the data file is comma-separated list of numbers representing an input/output pair. All but the last value on a line constitute an input vector. The last value on each line is the class label: +1/-1 for democrat/republican, 1/0 for spam/not spam, or 0–9 for handwritten digits. The .debug file contains a small random subset of the .data file, to be used for incremental implementation and testing. The spambase data set additionally has two versions, one with boolean features (a word does/does not appear) and one with continuous features (the frequency with which a word appears).

These files are located in my public directory. You can grab them with the following command:

cp ~bryce/public/cs63/lab08/* .
I recommend starting your incremental testing with the house_votes data set as it has both the smallest number of training examples and the smallest input dimension.

Classification Algorithms

For each classification algorithm, we require the following three methods, which should look familiar from lab 7:

The test() method will have the same implementation for all three algorithms, and is therefore implemented in the parent class, Classifier. You are encouraged to decompose these tasks into smaller units and write helper functions that will be easier to debug.

K Nearest Neighbors

The k nearest neighbors algorithm classifies test points according to a plurality vote over the closest training points. The train() function for KNN is quite simple. Its job is simply to normalize input vectors and store them in a data structure (in our case, a list). The purpose of normalization is to ensure that no single dimension swamps the Euclidean distance calculation performed in the predict() function. Each feature (dimension) of the x_test inputs needs to be linearly re-scaled so that the values range from 0 to 1. The same normalization will be applied to test points later, so the normalization constants must be stored.

The predict() function needs to classify a new test point, and does so by identifying the k training points closest by Euclidian distance to the (normalized) test point. In the case of a tie for the k-th nearest point, all tied points should be included in the neighbor set.

The most-common label across the k closest training points is returned as the predicted output. There are various possible methods for breaking ties, but the one we will use is to iteratively remove the most distant of the k points until the tie is broken. This is guaranteed to eventually return a unique label when the neighborhood gets down to a single point.

As mentioned above, you are encouraged to break down these functions, especially predict(). Useful helper functions might include normalize() to convert a test point to the same scale as the training set and neighbors() to find the set of >=k points closest to the test input.

Naive Bayes

The Naive Bayes classifier uses the training data to estimate the following probabilities:

The training data is combined with equiv_samples from a uniform prior to avoid overfitting. An example is provided here.

To classify a test point, the naive probability of each label P(label = l | x1 = v1, x2 = v2, ... ) is computed from these using Bayes rule, and the highest-probability label is returned. Technically Bayes rule also requires a normalization factor, P(x1 = v1, x2 = v2, ... ), but because this same term appears in the denominator for all labels, it can be ignored. The naive Bayes estimate pretends that all features are independent, and therefore estimates label probabilities as the product of each specific feature's probability:

P(label = l | x = v) = P(label = l) * P(x_1 = v_1 | label = l) * P(x_2 = v_2 | label = l) * ...
Some possible ideas for helper functions:

Testing and Cross Validation

In addition to incrementally testing and debugging your code, you should write a main function that verifies the correctness of your implementations. One particularly useful check for naive Bayes would be to ensure that the probabilities for each label sum to 1, both unconditionally and conditional on each ndependent input value.

Further, for both of the algorithms you implemented and for the the support vector machine that has been provided, there is a free parameter that must be selected. One good way to do this is by n-fold cross validation, which you will implement in the Classifier parent class from which the various classifier implementations inherit. The idea behind cross validation is to try n distinct training set / test set splits. For each split, 1/n of the data is used as the test set and the remainder is used for training. For each parameter value under consideration, we can train and test n times, and select the parameter value with the lowest average error. You will do use cross validation to select among the following parameters for your experiments:

The optimal value for these parameters may vary across data sets. Because cross validation requires repeated training and testing of the model, it may be rather slow. If it's taking too long, first, try reducing the number of folds (if using >3). Then you can try running on a random subset of the data, though not as small a subset as you use for debugging.

The cross validation task has already been broken down slightly with the helper function test(), which should return a success rate on a given test set. You are, of course, welcome to decompose the task further.

Experimental Comparison of Algorithms

Once you have chosen parameters for each algorithm, you should compare the different algorithms on each of the classification tasks. In the file experiments.txt, report the error rate of each classifier on each data set, as well as the parameters used. In addition, try to identify which data points are hard for each of the algorithms: is the classifier making a systematic mistake, or do the errors seem to be the result of random noise?

Submitting your code

Before submitting, ensure that you have added your name(s) to the top-level comments. To submit your code, you need to use git to add, commit, and push the files that you modified.
cd /my/lab/directory/
git add ClassificationModels.py experiments.txt
git commit -m "final version"
git push