CS66 Project 1: Decision Trees

Due by 11:59 p.m.,Monday, October 2, 2017. Tuesday, October 3, 2017.
Overview

The goals of this week's lab:

You make work with one lab partner for this lab. You may discuss high level ideas with any other group, but examining or sharing code is a violation of the department academic integrity policy.


Getting Started

Projects in this course will be distributed and submitted through Swarthmore GitHub Enterprise. If you have never used the Swarthmore GitHub before, you should follow our GitHub Setup Guide. If you have taken CS31 or CS35 recently, you will be familiar with the process. Find your git repo for this lab assignment with name format of Project1-id1_id2 (id1 being replaced by the user id of you and your partner): CPSC66-F17

Clone your git repo with the starting point files into your labs directory:

$ cd cs66/labs
$ git clone [the ssh url to your your repo)
Then cd into your Project1-id1_id2 subdirectory. You will have the following files (those in blue require your modification):

Part 1: Implement a Decision Tree

You will implement an ID3-like decision tree in this part of the lab. The main features of your implementation include:

Usage

Your program should take in 1 or 2 command-line arguments in this specific order:

  1. The name of the data set
  2. A parameter setting for the maximum depth of the learned tree (optional)
For example:
python run_dtree.py toy 3
If the user specifies an incorrect number of arguments, or invalid values for the parameters print a usage statement and exit the program. You should error check all parameters to ensure that the files exist and that given parameters are integers greater than 0. E.g.,
$ python run_dtree.py
Error, incorrect number of arguments
Usage: python run_dtree.py dataset [depth]
  dataset - base name of input arff files in the input/ folder
  depth (optional) - maximum depth of tree, set as a positive integer value
Note that the dataset name (e.g., toy) will determine 4 file names: You must ensure that the two input files actually exist. Please follow this format exactly; these are easy points for your grade and they significantly reduce the number of headaches I will face during grading.

Program Style Requirements

Program Inputs

Program Main Behavior

Your program should process inputs, parse and index training and test examples, and induce a decision tree. There are very few restrictions on how you do this, but be sure your tree fits the behavior from class:

Program output

When run, your program should output the following:

Example runs

Heart disease data set:

Diabetes data set: Trees if converted discrete features to binary as well (no depth limit):


Part 2: Analysis

Once your lab is complete, generate two learning curve graphs (i.e., line graphs) with an accompanying analysis of the graphs. On the x-axis, show the depth of a tree. On the y-axis, show the accuracy of the tree on both the training set and test set. Run experiments at various depth levels including, at a minimum, depths of 1, 2, 4, 7, and 10. You should have one graph for the diabetes data set and one graph for the heart disease data set.

Describe what you can conclude from each graph. What happens to the training set accuracy and test set accuracy as the depth varies. Connect these results to class discussion. Do they agree or disagree with our discussion? Why or why not? Place your graphs and analysis in a PDF file called analysis.pdf. Be sure that your graphs are of scientific quality - including captions, axis labels, distinguishable curves, and legible font sizes.


Extensions (Optional)

These extensions are optional if you are interested in going beyond the requirements. Meeting the minimum requirements typically earns an A on labs - 9.5 out of 10. I reserve perfects scores for labs that are exceptional and go beyond the minimum. Doing the extensions do not make up for missed requirements above, and they are a lot of work for the relative amount of gain. But they can be rewarding and can aid understanding. Be sure your extensions are a net plus on the requirements above - I should be able to match expected behavior on the basic version of the lab above.

Mult-class prediction

Both data sets I have provided have two class labels. Decision trees easily extend to multi-class prediction. Find a relevant data set (e.g., at Kaggle or the UC Irvine Machine Learning Repository) and evaluate your algorithm with multiple nominal values allowed for labels.

Min Leaf Size

Another approach to prevent overfitting is setting a minimum number of examples in a leaf. If a split causes results in children below the threshold, the recursion is stopped at the current node and a leaf is created with the plurality label. This is often more flexible than maximum depth - it allows variable depth branches while still trying to prevent overly detailed hypotheses that only fit to a few examples. Add minimum leaf size as an optional command line argument.

Learning curves for training set size

Machine learning algorithms are often sensitive to the amount of training data - some algorithms thrive when there is large amounts of data but struggle when data is sparse (e.g., deep neural networks) while others plateau in performance even if more data is available. Evaluate your decision tree on the given training data by randomly subsampling the data to create smaller training sets e.g., use 10% of training data. Choose at least 5 training set sizes and plot them on the x-axis with the y-axis describing the accuracy. You'll probably want to run each training set size 3-5 times and average the results. Describe what you see when you compare training and test accuracy as the training set grows.
Submitting your work
For the programming portion, be sure to commit your work often to prevent lost data. Only your final pushed solution will be graded. Only files in the main directory will be graded. Please double check all requirements; common errors include:

Acknowledgements

Both data sets were made available by the UC Irvine Machine Learning Repository. The heart disease data set is the result of a study by the Cleveland Clinic to study coronary heart disease. The patients were identified as having or not having heart disease and the resulting feature set is a subset of the original 76 factors that were studied. The diabetes data set is the result of a study of females at least 21 years old of Pima Indian heritage to understand risk factors of diabetes. Both data sets were converted to ARFF format by Mark Craven at the University of Wisconsin.