CS66 Project 2: Probabilistic Models

Due by 11:59 p.m., Monday, October 30, 2017
Overview

The goals of this week's lab:

You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code is a violation of the department academic integrity policy. Note that there is a check point requirement for this lab; I will check that you have completed the Naive Bayes portion by lab on Thursday, October 26.


Getting Started

Projects in this course will be distributed and submitted through Swarthmore GitHub Enterprise. If you have never used the Swarthmore GitHub before, you should follow our GitHub Setup Guide. If you have taken CS31 or CS35 recently, you will be familiar with the process. Find your git repo for this lab assignment with name format of Project2-id1_id2 (id1 being replaced by the user id of you and your partner): CPSC66-F17

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo)
Then cd into your Project2-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

Common Implementation Details

Usage

Your programs should take in 1 or 2 command-line arguments as in the previous lab. The first argument for both programs will be the data set. For logisitc regression, you will also take in the learning rate. For example, to run naive Bayes on the tennis example:

python run_NB.py tennis
If the user specifies an incorrect number of arguments, or invalid values for the parameters print a usage statement and exit the program. You should error check all parameters to ensure that the files exist. The learning rate must be greater than 0. E.g.,
$ python run_LR.py
Error, incorrect number of arguments
Usage: python run_LR.py dataset [eta]
  dataset - base name of input arff files in the input/ folder
  eta - learning rate during gradient descent
Note that the dataset name (e.g., tennis) will determine 3 file names: You must ensure that the two input files actually exist. Please follow this format exactly; these are easy points for your grade and they significantly reduce the number of headaches I will face during grading.

Program Inputs

To simplify preprocessing, you may assume the following:

For Both Classifiers

Naive Bayes Implementation
You will implement a multinomial (multi-value features) Naive Bayes as discussed in class. Your model will be capable of making multi-class predictions or binary predictions.

Model

All probabilities should be modeled using log probabilities to avoid underflow. Note that when taking the log of a formula, all multiplications become additions and division become subtractions. That is, our Naive Bayes model for a class label yj:

becomes:

Note that we estimate the probabilities using LaPlace counts (thetas), so our final model for making predictions is:

Training

You will need to represent model parameters (thetas) for each class probability P(Y = y_j) and each class conditional probability P(X_i = v | Y=y_j). This will require you to count the frequency of each event in your training data set. Use LaPlace estimates to avoid 0 counts. If you think about the formula carefully, you will note that we can implement our training phase using only integers to keep track of counts (and no division!). Each theta is just a normalized frequency where we count how many examples where Xi=v and the label is yj and divide by the total number of examples where the label is yj (plus the LaPlace estimate):
Taking the log and each theta becomes the difference between the two counts:


The important point is that it is completely possible to train your Naive Bayes with only integers (the counts, N). You don't need to calculate log scores until prediction time. This will make debugging (and training) much simpler.

Inference/Prediction

For this phase, simply calculate the probability of each class (using the model equations above) and choose the maximum probability (recall that the max probability is also the max log odds):


To convert your log odds to a probability value, exponentiate the value using math.exp(score) where score is your log odds value.

Example Output

Note that I have included Debugging Output below to help diagnose potential problems. It includes frequency counts as well as predictions. The predictions include the predicted label (y^), the raw score of that label, and then the normalized probability (divide by the sum of all other possible label scores). You do not need to calculate all of these values, they are for reference. You do not/should not have this level of verbosity in your final program.
Logistic Regression Implementation
You will implement the logistic regression task in class, with the extension to multi-valued features and multi-valued prediction.

Model

In all situations, you should use binary variables for Y and X. "negative" should be presented as a 0 and "positive" as a 1. In the case of binary labels, the probability of a positive prediction is:


If the user provided task is binary, you should treat the second entry in the arff file as the positive class (e.g., { No, Yes} would mean Yes is the positive label. During prediction, return positive (e.g., Yes) if the probability from the above equation is greater than 0.5 (else, the negative class e.g., No).

For multinomial regression, you will be build a 1-vs-all classifier for each of the K classes. For example, in the zoo dataset where there are possible labels 1,2,3,...,7, you will build a classifier where 1 is positive and all other classes are negative; a seperate classifier where 2 is positive and all other classes are negative (e.g., 0); and so on for 7 classifiers in total.

In addition, you will need to process the input files to handle feature non-binary feature inputs. If a feature is binary, use the same rule as above where the first possible value is treated as a 0 and the second value is a 1. For multi-nomial featuers, we will need a weight for each possible value of a feature, X_i=v. You should binarize these features by removing the original feature and adding a new column for each feature/value pair. For example, if you have a feature color={blue,red,green} you will add features {w_color=blue,w_color=red, w_color=green}. For an example where color=green, the corresponding feature values for that example would be x_color=blue=0, x_color=green=1, x_color=red=0.

To model an intercept term (if data is not centered), we will include a bias term. In practice, this is equivalent to adding a feature that is always "on". For each instance in training and testing, append a 1 to the feature vector. Ensure that your logistic regression model has a weight for this feature that it can now learn in the same way it learns all other weights.

Training

To learn the weights, you will apply gradient ascent as discussed in class until the weights do not change in value for a given iteration.
while not converged:
    initialize gradient to 0 for each feature g = [0,...,0]
    for each training example xi:
        p = P(Y=1|xi) //predict probabilty of positive under current model
        error = yi - p // calculate error in prediction
        for each feature j:
            g[j] += error*xi[j]
    w += eta*g //update weights
The hyperparameter eta (learning rate) should be sent as a parameter to the constructor and used in training. You should only update weights once per iteration (i.e., updating the weights is the last step of the iteration). A few notes for the above:

Prediction

For binary prediction, apply the equation:


and output the more likely binary outcome. Use the Python math.exp function to make the calculation in the denominator. Furthermore, wTx is the dot product of w and example x.

For multi-class prediction, you will build K 1 vs. all models. For a test example, you will run it through each of the K models. The predicted label is the model with the highest P(Y=1|x_test) for the given test example.

Example Output

I have included Debugging Output below to help diagnose potential problems. It includes the change in gradients (delta) per round, the learned weights for the first two iterations as well as the final learned weights, and the L1-norm of the weight vector at each iteration. You do not need to calculate all of these values, they are for reference. You do not/should not have this level of verbosity in your final program.
Tips and Hints
Extensions (Optional)

These extensions are optional if you are interested in going beyond the requirements. Doing the extensions do not make up for missed requirements above, and they are a lot of work for the relative amount of gain. But they can be rewarding and can aid understanding. Be sure your extensions are a net plus on the requirements above - I should be able to match expected behavior on the basic version of the lab above.

Continuous Features

Extend both models to all continuous features. For Naive Bayes, a common optionis to estimate a Gaussian distribution for each continuous feature. That is, P(Xi|Y=yj) will require an estimate of mean and standard deviation. Both can be estimated in a similar fashion to the counts above: for all of the times that Y=yj, find the mean and standard deviation of Xi. For Logistic Regression, you do need need to make many modifications - you need to allow non-binary features into the model. That is, in gradient descent, multiplying by Xi will no longer be limited to 0 and 1. I would recommend standardizing or normalizing continuous features. Use the titanic data set from HW1, or any data set from the UCI repository that contains continuous features.

ROC Curves

This only applies to binary classification tasks. Use the probabilistic predictions of both models to generate ROC Curves.
Submitting your work
For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include: Also, be sure to meet the style requirements:

Program Style Requirements