Lab 10: Machine learning project
Due May 8 by midnight


Machine Learning Contests

In the past decade, a large amount of work in machine learning has been motivated by various contests and challenges. One of the best known was the Netflix prize (official site, Wikipedia), which offered $1M to the team that could improve the site's recommendation system by 10%. The Netflix prize was claimed in 2009; since then machine learning contests have become commonplace.

For this project you will take on a machine learning challenge of your choice from kaggle. Some of these contests are currently active, with prizes available. However, you are welcome to work on contest problems that have expired or (with approval) on another machine learning problem of your choosing. On kaggle, you can see old under all competitions.

Many of these challenges involve large data sets that could quickly blow through your disk quota. To avoid this, you can save them to /scratch (instructions), which is unlimited, but isn't backed up. Also, take a look at the department's suggestions for long running jobs.

In order to download data, you will need to sign up for a free account. Kaggle also has a discussion forum, which may have useful suggestions, especially if you are working on an active contest.

Tools

For this project you should not use the code that you have written in previous labs to implement machine learning algorithms. Instead, you should make as much use as you can of the machine learning libraries scikit-learn and keras. We have seen both of these libraries before: in labs 7 and 8. Keras is an excellent resource for neural networks; scikit-learn should be your go-to library for all other machine learning algorithms.

Scikit-Learn

Scikit-learn is a collection of Python libraries that implement a large number of machine learning algorithms. We previously used the sklearn implementation of support vector machines in lab 8. Scikit-learn has a huge collection of classification, clustering, and regression techniques. Here is the documentation for sklearn implementations of many of the algorithms we have studied:

Scikit-learn also has modules for preprocessing data and for evaluating models with cross validation. Feel free to poke around in the scikit-learn documentation for other tools and algorithms.

Keras

Keras provides an easy-to-use python interface to the TensorFlow and Theano deep learning libraries. In lab 7, we used keras to train neural networks on the MNIST handwritten digit data set. Scikit-learn has a rudimentary neural network implementation, but if you plan on using neural networks in you project, you should use Keras.

Ensemble Methods

The Netflix prize was open for nearly three years, during which time several research teams improved prediction scores using a variety of machine learning methods. The prize was ultimately won by a meta-group of research teams whose meta-algorithm ran each team's classifier in parallel and made recommendations based on their joint results. This outcome demonstrates the power of ensemble learning.

Ensemble learning is based on the idea that combining the output of many weak classifiers can make a strong classifier that outperforms all of its component parts. Ensemble learning is feasible as long as all of the component classifiers are useful (they perform better than random guessing) and not entirely redundant (they sometimes give different answers).

Two common methods for ensemble learning are boosting and bagging. Boosting works by training many weak classifiers (such as shallow decision trees) on the same data and then taking a plurality vote or weighted average over their outputs in order to classify a new point. Bagging works by training many highly specific classifiers (such as deep decision trees) on random subsets of the data, and again taking a vote or average over their output labels.

You are expected to use at least one ensemble method in your project. The ensemble method may or may not be the best learning algorithm for your task, but your writeup should report the results of testing the ensemble against its component algorithms. Several variations on boosting and bagging are implemented by scikit-learn. Documentation can be found in the ensemble methods section.

Submitting

Before the deadline, you need to submit the following things through git:

In addition, you must turn in a hard copy of the writeup pdf outside my office.

In the LaTex file, project.tex, you will describe your project. This file already contains a basic structure that you should follow. Feel free to change the section headings, or to add additional sections. Recall that you use pdflatex to convert the LaTex into a pdf file. Here is a template for your paper.

As your project develops and you create more files, be sure to use git to add, commit, and push them. Run: git status to check that all of the necessary files are being tracked in your git repo. Don't forget to update the README so that I can test your code!