You will also be testing a third algorithm:
The only Python file you will need to modify is ClassificationModels.py. You will be provided with three data sets for testing the algorithms:
For each data set, there are three files. The .summary file is a description of the data set that may be interesting to look over, but will not be used by our classifiers. The .data file contains the actual data set to be learned. Each line of the data file is comma-separated list of numbers representing an input/output pair. All but the last value on a line constitute an input vector. The last value on each line is the class label: +1/-1 for democrat/republican, 1/0 for spam/not spam, or 0–9 for handwritten digits. The .debug file contains a small random subset of the .data file, to be used for incremental implementation and testing. The spambase data set additionally has two versions, one with boolean features (a word does/does not appear) and one with continuous features (the frequency with which a word appears).
These files are located in my public directory. You can grab them with the following command:
cp ~bryce/public/cs63/lab08/* .I recommend starting your incremental testing with the house_votes data set as it has both the smallest number of training examples and the smallest input dimension.
The test() method will have the same implementation for all three algorithms, and is therefore implemented in the parent class, Classifier. You are encouraged to decompose these tasks into smaller units and write helper functions that will be easier to debug.
The predict() function needs to classify a new test point, and does so by identifying the k training points closest by Euclidian distance to the (normalized) test point. In the case of a tie for the k-th nearest point, all tied points should be included in the neighbor set.
The most-common label across the k closest training points is returned as the predicted output. There are various possible methods for breaking ties, but the one we will use is to iteratively remove the most distant of the k points until the tie is broken. This is guaranteed to eventually return a unique label when the neighborhood gets down to a single point.
As mentioned above, you are encouraged to break down these functions, especially predict(). Useful helper functions might include normalize() to convert a test point to the same scale as the training set and neighbors() to find the set of >=k points closest to the test input.
The training data is combined with equiv_samples from a uniform prior to avoid overfitting. An example is provided here.
To classify a test point, the naive probability of each label P(label = l | x1 = v1, x2 = v2, ... ) is computed from these using Bayes rule, and the highest-probability label is returned. Technically Bayes rule also requires a normalization factor, P(x1 = v1, x2 = v2, ... ), but because this same term appears in the denominator for all labels, it can be ignored. The naive Bayes estimate pretends that all features are independent, and therefore estimates label probabilities as the product of each specific feature's probability:
P(label = l | x = v) = P(label = l) * P(x_1 = v_1 | label = l) * P(x_2 = v_2 | label = l) * ...Some possible ideas for helper functions:
Further, for both of the algorithms you implemented and for the the support vector machine that has been provided, there is a free parameter that must be selected. One good way to do this is by n-fold cross validation, which you will implement in the Classifier parent class from which the various classifier implementations inherit. The idea behind cross validation is to try n distinct training set / test set splits. For each split, 1/n of the data is used as the test set and the remainder is used for training. For each parameter value under consideration, we can train and test n times, and select the parameter value with the lowest average error. You will do use cross validation to select among the following parameters for your experiments:
The optimal value for these parameters may vary across data sets. Because cross validation requires repeated training and testing of the model, it may be rather slow. If it's taking too long, first, try reducing the number of folds (if using >3). Then you can try running on a random subset of the data, though not as small a subset as you use for debugging.
The cross validation task has already been broken down slightly with the helper function test(), which should return a success rate on a given test set. You are, of course, welcome to decompose the task further.
cd /my/lab/directory/ git add ClassificationModels.py experiments.txt git commit -m "final version" git push