CS63 Fall 2005
Lab 5: Decision Trees
Due: Friday, October 7 by 11am

CONTENTS


INTRODUCTION

For this lab you will implement the ID3 decision tree learning algorithm. You'll first experiment with a simple data set to decide whether or not to play tennis based on the weather conditions. Once you are confident that your program works correctly, you will modify it to handle missing attribute values, both in the tree-building phase and in the testing phase. Then you'll experiment with a larger data set describing congressional voting records in 1984 from the U.S. House of Representatives to decide whether each representative is either a democrat or a republican.

You may work with a partner on this assignment.


LAB INSTRUCTIONS

  1. Save a copy of the following data files in your own directory. Notice that each data set is stored as a list of dictionaries. Each example is a dictionary where the attributes are the keys. Each example is also give a unique identifier (the 'id' attribute). When constructing a decision tree this attribute should be ignored, however it is helpful in debugging.
  2. Implement the ID3 algorithm. Create two classes to accomplish this: a DecisionTreeNode and a DecisionTree. The node class should maintain the following class variables: the list of examples associated with the node, the level of the node in the search tree, the best attribute for splitting at that node, the attribute and value associated with that node, a list of branches, and for a leaf the classification of the node.
  3. Test your program on the tennis data. Your program should produce the decision tree shown below:

  4. Add methods to your program to allow you to randomly choose a subset of the examples as a training set and then use the remaining examples as a testing set. You should be able to specify a particular percentage of the examples for the training set. You'll also need to add a method that allows you to classify a novel example given a learned tree, and a method that classifies the entire testing set and reports the percent correct. Test these features before going on.
  5. Once you are certain that your program is working correctly, modify it to handle missing attribute values. For the construction phase, you'll need to update the infoGain method so that if there are missing values for a given attribute, you randomly replace them with values that reflect the distribution of the values for that attribute in the current set of examples. For the classification phase, you'll need to update your methods for traversing the tree so that if a novel example has a missing attribute value you again randomly replace the missing value based on the distribution of values stored at that particular node in the tree.
  6. Measure the impact of training set size on the accuracy of the learned tree on the testing set. Use a consistent minimum number examples per branch which is set to 5. Consider training set sizes of 30%, 40%, 50%, 60%, 70% and 80%. Because of the high variance due to random splits, repeat the experiment five times for each training set size. Report the average accuracy for each size.
  7. Measure the impact of the minimum number of examples per branch allowed on the accuracy of the learned tree on the testing set. Use the best training set size found from the previous set of experiments. Consider the following minimum numbers of examples per branch: 1, 5, 10, 15, and 20. Again, repeat the experiment five times for each variation. Report the average accuracy for each variation.

HAND IN

Use cs63handin to turn in your decision tree implementation. If you worked as a team be sure to include both partner's names at the top of the file.