CS66 Project 3: SVMs and Ensembles

Due by 11:59 p.m., Wednesday, November 15, 2017
Overview

The goals of this week's lab:

You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code (or results/analysis) is a violation of the department academic integrity policy.

Getting Started

Projects in this course will be distributed and submitted through Swarthmore GitHub Enterprise. Find your git repo for this lab assignment with name format of Project3-id1_id2 (id1 being replaced by the user id of you and your partner): CPSC66-F17

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo)
Then cd into your Project3-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

Datasets

Rather than parse and process data sets, you will utilize scikit-learn's pre-defined data sets. Details can be found here. At a minimum, your experiments will require using the MNIST and 20 Newsgroup datasets. Both are multi-class tasks (10 and 20 classes, respectively). Note that both of these are large and take time to run, so I recommend developing using the Wisconsin Breast Cancer dataset:

if sys.argv[1] == "cancer":
  data = load_breast_cancer()

X = data['data']
y = data['target']
print(X.shape)
print(y.shape)
which outputs 569 examples with 30 features each:
(569, 30)
(569,)
The MNIST dataset is very large and takes a lot of time to run, so you can randomly subselect 1000 examples; you should also normalize the pixel values between 0 and 1 (instead of 0 and 255):
data = fetch_mldata('MNIST Original', data_home="~soni/public/cs66/sklearn-data/")
X = data['data']
y = data['target']
X,y = utils.shuffle(X,y) #Shuffle the rows
X = X[:1000] #Only keep 1000 training examples
y = y[:1000]
X = X/255.0 #Normalize the feature values
The newsgroup dataset in vector form (i.e., bag of words) is obtained using:
data = fetch_20newsgroups_vectorized(subset='all', data_home="~soni/public/cs66/sklearn-data/")
No normalization is required; I also suggest randomly sampling 1000 examples for this dataset as well. The data object also contains headers and target information which you should examine (e.g., in a Jupyter notebook) for understanding. For your analyis, it may be helpful to know the number of features, their types, and what classes are being predicted.

Coding requirements

The coding portion is flexible - the goal is to be able to execute the experiments below. However, you should keep these requirements in mind:

Experiment 1: SVM vs RandomForest Generalization Error

Using run_pipeline.py, you will run both SVMs and Random Forests and compare which does better in terms of estimated generalization error.

Coding Details

Your program should read in the dataset (MNIST or 20 Newsgroup, at a minimum) defined by the command line, as discussed above e.g.,
$ python run_pipeline.py mnist
$ python run_pipeline.py news
You should specify your parameters and classifier and call runTuneTest (see the above example), which follows this sequence of steps:
  1. Dividing the data into training/test splits using StratifiedKFold. Follow this example to create a for-loop for each fold. Set the parameters to shuffle the data and use 5 folds (cv). Set the random_state to a fixed integer value (e.g., 42) so the folds are consistent for both algorithms.
  2. For each fold, tune the hyperparameters using GridSearchCV, which is a wrapper for your base learning algorithms; it automates the search over multiple hyperparameters. Use the default value of 3-fold tuning.
  3. After creating a GridSearchCV classifier, fit it using your training data
  4. Get the test-set accuracy by running the GridSearchCV score method taking in the fold's test data.
  5. Return a list of accuracy scores for each fold.
In main(), you should print the paired test accuracies for all 5 folds for both classifiers. Your classifiers/hyperparameters are defined as follows: Code incrementally, and be sure to examine the results of your tuning (what were the best hyperparameter settings? what were the scores across each parameter?) to ensure you have the pipeline correct. Since the analysis below is dependent on your results, I cannot provide sample output for this task. However, this is what is generated if I change my classifier to K-Nearest Neighbors using the parameters listed in the previous section (you can try to replicate this using a random_state of 42):
$ python run_pipeline.py mnist
RUNNING 5-Fold CV on KNN
------------------------
Fold 1:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.827

Fold 2:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.817

Fold 3:
('Best parameters:', "{'n_neighbors': 1, 'weights': 'uniform'}")
Tuning Set Score: 0.837

Fold 4:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.846

Fold 5:
('Best parameters:', "{'n_neighbors': 5, 'weights': 'distance'}")
Tuning Set Score: 0.840

Fold, Test Accuracy
0, 0.892
1, 0.916
2, 0.861
3, 0.812
4, 0.862
Note that StratifiedKFold changed between version 0.17.1 (the one on our systems) and the current version, so you'll get different results if you are developing on your own computer. They should be in the same ball park in terms of accuracy.

Analysis

In part 1 of your writeup, you will analyze your results. At a minimum, your submission should include the following type of analysis:

Your analysis should be written as if it were the results section of a scientific report/paper.

Experiment 2: Learning Curves

Using generateCurves.py, you will generate the data for learning curves for the above two classifiers. Since we are not interested in generalization accuracy here, we will generate the curves using one round of train/tune splits.

Coding Requirements

Here is the result if I run KNeighborsClassifier with all odd k values from 1 to 21:
$ python generateCurves.py mnist
Neighbors, Train Accuracy, Test Accuracy
1, 1.000, 0.837
3, 0.930, 0.833
5, 0.910, 0.838
7, 0.896, 0.834
9, 0.880, 0.817
11, 0.866, 0.823
13, 0.853, 0.819
15, 0.842, 0.811
17, 0.833, 0.809
19, 0.824, 0.800
21, 0.816, 0.795

Analysis

Analyse your results for experiment 2. At a minimum, you should have:

Submitting your work

For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:

Program Style Requirements