Week 2 - Lab Exercise

This exercise is inspired by Aurelien Geron's Sample Code

Goals

  • Become familiar with Python libraries that are useful for developing machine learning solutions
    • numpy - a powerful library for scientific computing, particularly for handling N-dimensional arrays and performing linear algebra operations. Most of your data will be formated using numpy.
    • matplotlib - adds Matlab-like capabilities to Python, including visualization/plotting of data and images. Useful for inspecting data sets and visualizing results.
    • sklearn - a very popular machine learning toolkit for Python with implementations of almost all common machine learning algorithms and extensions
  • Implement decision trees in scikit-learn
  • Visualize the decision surface and performance of learned models

Jupyter Notebook

We will ust a Jupyter notebook to progressively implement this exercise and view the code running all within your browser window. If this does not work, you can instead use the python interpretor:

python -i

and type (or copy-paste) the code below line-by-line. You can also download the stand alone python file and analyze the code line by line (this is not recommended).

From Prof. Meeden's Jupyter explanation:

A Jupyter notebook is a document that can contain both executable code and explanatory text. It is a great tool for documenting and illustrating scientific results. In fact, this lab writeup is a notebook that was saved as an html file. A notebook is made up of a sequence of cells. Each cell can be of a different type. The most common types are Code and Markdown. We will be using Code cells that are running Python 3, though there are many other possibilities. Markdown cells allow you to format rich text. In your terminal window, where you are already in your cs66 lab directory, type the following to start a notebook:

jupyter notebook

A web browser window will appear with the Jupyter home screen. On the right-hand side, go to the New drop down menu and select Python3. A blank notebook will appear with a single cell. Let's try writing and executing a simple hello program. One of the main differences between Python 2 and Python 3, is that print is a function in Python 3. When you are ready to execute the cell, press Shift-Enter.

In [ ]:
def hello(name):
    print("Hi", name)
    
hello("Chris")
hello("Samantha")

You should see the output of the code after you execute it:

Hi Chris Hi Samantha

To name your notebook, double click on the default name, Untitled, at the top. Let's call it FirstNotebook. To save a notebook, click on the disk symbol on the left-hand side of the tool bar. You can also use the File menu and choose Save and Checkpoint. Explore the other menu options in the notebook. Figure out how to insert and delete cells, which are common commands you'll need to know.

To exit a notebook, save it first, then from the File menu choose Close and Halt. In the terminal window where you started the notebook, you'll also need to type CTRL+C to shutdown the kernel. Do an ls to list the contents of your directory. You'll see that you now have a file called FirstNotebook.ipynb.

Setting up Lab Exercise

If you haven't already done so, follow the instructions above to start your Jupyter Notebook server. Create a new Python 3 notebook and name it as you see fit e.g., DTreeTutorial.

Enter the following code into your first cell, paying close attention to the comments to understand what each line is doing. This will set up compatability and needed libraries. When done entering it your notebook, hit Shift+Enter to execute.

In [ ]:
# To support both python 2 and python 3
from __future__ import division, print_function, unicode_literals

# Common imports
import numpy as np
import os

# to make this notebook's output stable across runs
np.random.seed(42)

# To plot pretty figures
#If using the pythong interpretor, omit this first line.  It only applies to the Jupyter environment
%matplotlib inline 
import matplotlib
import matplotlib.pyplot as plt
plt.rcParams['axes.labelsize'] = 14
plt.rcParams['xtick.labelsize'] = 12
plt.rcParams['ytick.labelsize'] = 12

def save_fig(fig_id, tight_layout=True):
    print("Saving figure", fig_id)
    if tight_layout:
        plt.tight_layout()
    plt.savefig(fig_id + ".png", format='png', dpi=300)

Understanding

  1. What does plt.rcParams refer to? Read here. What other options does it give us?
  2. How do we save plot images to disk? What are the semantics of this function call?

Training A Decision Tree

Let's get to it. We'll use the pre-loaded Iris data set that we referred to in class and build a Decision Tree Classifier.

In [ ]:
from sklearn.datasets import load_iris
from sklearn.tree import DecisionTreeClassifier

iris = load_iris()
X = iris.data[:, 2:]
y = iris.target

tree_clf = DecisionTreeClassifier(max_depth=2, random_state=42)
tree_clf.fit(X, y)

Understanding

  1. Read details about the Iris data set here and here. What features are in the data set? What is the target (y) variable and what possible values does it take on?
  2. Which features are we keeping for our training set (X)? How would you change the code to choose different features? When you finish this tutorial, come back and modify this code to see how it influeces the results.
  3. Read an explanation of Decision Trees and then look at the documentation for the function to see what parameters exist. What tips to they provide for practical use?
  4. What version of decision trees is implemented by scikit-learn? How does it relate to those discussed in class?
  5. Note that information gain is not the default selection criteria. How do we change that (experiment with this later). How does the default compare to information gain? Here is a post with some useful information (don't be too fixated on this choice - it doesn't usually make a difference)
  6. max_depth is set to 2. How does this related to our discussion about stopping criteria and overfitting? How do we prevent the use of max_depth. Can you find other parameters to prevent overfitting? Again, I recommend coming back to this later and playing around with different options in the code.

Visualize the tree

Now, visualize your tree. Execute the following code:

In [ ]:
from sklearn.tree import export_graphviz

export_graphviz(
        tree_clf,
        out_file="iris_tree.dot",
        feature_names=iris.feature_names[2:],
        class_names=iris.target_names,
        rounded=True,
        filled=True
    )

To view your tree, open a command line and find the directory with iris_tree.dot. Convert the dot file to a PDF or PNG:

$ dot -Tpdf iris_tree.dot -o iris_tree.pdf $ xpdf iris_tree.pdf

What features are in your decision tree? How many nodes? What are the class distributions for the leaves?

Visualize the decision boundary

In [ ]:
from matplotlib.colors import ListedColormap

def plot_decision_boundary(clf, X, y, axes=[0, 7.5, 0, 3], iris=True, legend=False, plot_training=True):
    x1s = np.linspace(axes[0], axes[1], 100)
    x2s = np.linspace(axes[2], axes[3], 100)
    x1, x2 = np.meshgrid(x1s, x2s)
    X_new = np.c_[x1.ravel(), x2.ravel()]
    y_pred = clf.predict(X_new).reshape(x1.shape)
    custom_cmap = ListedColormap(['#fafab0','#9898ff','#a0faa0'])
    plt.contourf(x1, x2, y_pred, alpha=0.3, cmap=custom_cmap, linewidth=10)
    if not iris:
        custom_cmap2 = ListedColormap(['#7d7d58','#4c4c7f','#507d50'])
        plt.contour(x1, x2, y_pred, cmap=custom_cmap2, alpha=0.8)
    if plot_training:
        plt.plot(X[:, 0][y==0], X[:, 1][y==0], "yo", label="Iris-Setosa")
        plt.plot(X[:, 0][y==1], X[:, 1][y==1], "bs", label="Iris-Versicolor")
        plt.plot(X[:, 0][y==2], X[:, 1][y==2], "g^", label="Iris-Virginica")
        plt.axis(axes)
    if iris:
        plt.xlabel("Petal length", fontsize=14)
        plt.ylabel("Petal width", fontsize=14)
    else:
        plt.xlabel(r"$x_1$", fontsize=18)
        plt.ylabel(r"$x_2$", fontsize=18, rotation=0)
    if legend:
        plt.legend(loc="lower right", fontsize=14)

plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf, X, y)
plt.plot([2.45, 2.45], [0, 3], "k-", linewidth=2)
plt.plot([2.45, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.plot([4.95, 4.95], [0, 1.75], "k:", linewidth=2)
plt.plot([4.85, 4.85], [1.75, 3], "k:", linewidth=2)
plt.text(1.40, 1.0, "Depth=0", fontsize=15)
plt.text(3.2, 1.80, "Depth=1", fontsize=13)
plt.text(4.05, 0.5, "(Depth=2)", fontsize=11)

save_fig("decision_tree_decision_boundaries_plot")
plt.show()

Understanding

  1. Compare your decision tree to the decision space and note any correspondance
  2. You can return later and alter your tree model (e.g., different depth, different selection criteria). Regenerate your figure and compare. You may want to rename the figure so that it does not give overwritten each time.

Predicting classes and class probabilities

Once you have a model (tree), you'll want to use it to make predictions. The following two lines of code showcase two different predictions on a test example x=[5,1.5]:

In [ ]:
tree_clf.predict_proba([[5, 1.5]])
tree_clf.predict([[5, 1.5]])

Understanding

  1. What is the difference between these two options?
  2. How are probabilities determined?
  3. Input the example in the graph you generated earlier; do you get the same results as the code above?

Sensitivity to training set details

Decision trees tend to be sensitive to small changes in the data set. We'll run an experiment to understand this. First, let's remove the widest petal from the training set (length of 4.8cm and width of 1.8cm):

In [ ]:
X[(X[:, 1]==X[:, 1][y==1].max()) & (y==1)] # view the example that is the widest Iris-Versicolor flower
In [ ]:
not_widest_versicolor = (X[:, 1]!=1.8) | (y==2) #find indices of examples that are not the max width
X_tweaked = X[not_widest_versicolor] #create a training set with the widest petals removed
y_tweaked = y[not_widest_versicolor]

tree_clf_tweaked = DecisionTreeClassifier(max_depth=2, random_state=40) #Retrain
tree_clf_tweaked.fit(X_tweaked, y_tweaked)

Now let us plot the new result:

In [ ]:
plt.figure(figsize=(8, 4))
plot_decision_boundary(tree_clf_tweaked, X_tweaked, y_tweaked, legend=False)
plt.plot([0, 7.5], [0.8, 0.8], "k-", linewidth=2)
plt.plot([0, 7.5], [1.75, 1.75], "k--", linewidth=2)
plt.text(1.0, 0.9, "Depth=0", fontsize=15)
plt.text(1.0, 1.80, "Depth=1", fontsize=13)

save_fig("decision_tree_instability_plot")
plt.show()

Understanding

How does this differ from the previous decision boundary? Is there a way to mitigate this effect?