Kaggle project

Project Option 1: Kaggle Contests

Machine Learning Contests

In recent years, a large amount of work in machine learning has been motivated by various contests and challenges. One of the earliest and best known was the Netflix prize (official site, Wikipedia), which offered $1M to the team that could improve the site's recommendation system by 10%. The Netflix prize was claimed in 2009; since then machine learning contests have become commonplace.

For this project you will take on a machine learning challenge of your choice from kaggle. Some of these contests are currently active, with prizes available. On kaggle, you can see all past and current contests here.

Many of these challenges involve large data sets that could quickly blow through your disk quota. To avoid this, you can save them to /scratch (instructions), which is unlimited, but isn't backed up. Also, take a look at the department's suggestions for long running jobs.

In order to download data, you will need to sign up for a free account. Kaggle also has a discussion forum, which may have useful suggestions, especially if you are working on an active contest.

Tools

For this project you should not use the code that you have written in previous labs to implement machine learning algorithms. Instead, you should make as much use as you can of the machine learning libraries scikit-learn and keras. We have seen both of these libraries before: in labs 7–9. Keras is an excellent resource for neural networks; scikit-learn should be your go-to library for all other machine learning algorithms.

Scikit-Learn

Scikit-learn is a collection of Python libraries that implement a large number of machine learning algorithms. We previously used the sklearn implementation of linear regression and KNN in lab 7. Scikit-learn has a huge collection of classification, clustering, and regression techniques. Here is the documentation for sklearn implementations of many of the algorithms we have studied:

Scikit-learn also has modules for preprocessing data and for evaluating models with cross validation. Feel free to poke around in the scikit-learn documentation for other tools and algorithms.

Keras

Keras provides an easy-to-use python interface to the TensorFlow and Theano deep learning libraries. Scikit-learn has a rudimentary neural network implementation, but if you plan on using neural networks in your project, you should use Keras.

Ensemble Methods

The Netflix prize was open for nearly three years, during which time several research teams improved prediction scores using a variety of machine learning methods. The prize was ultimately won by a meta-group of research teams whose meta-algorithm ran each team's classifier in parallel and made recommendations based on their joint results. This outcome demonstrates the power of ensemble learning.

Ensemble learning is based on the idea that combining the output of many weak classifiers can make a strong classifier that outperforms all of its component parts. Ensemble learning is feasible as long as all of the component classifiers are useful (they perform better than random guessing) and not entirely redundant (they sometimes give different answers).

Two common methods for ensemble learning are boosting and bagging. Boosting works by training many weak classifiers (such as shallow decision trees) on the same data and then taking a plurality vote or weighted average over their outputs in order to classify a new point. Bagging works by training many highly specific classifiers (such as deep decision trees) on random subsets of the data, and again taking a vote or average over their output labels.

You are expected to use at least one ensemble method in your project. The ensemble method may or may not be the best learning algorithm for your task, but your writeup should report the results of testing the ensemble against other algorithms. Several variations on boosting and bagging are implemented by scikit-learn. Documentation can be found in the ensemble methods section.

Choosing a Contest

Kaggle competitions vary widely in what sort of data and instructions are provided. You should therefore think carefully about the competition you choose: not just "is it a cool problem?" but also "how hard will this data be to work with?" and "how clearly are the expectations of the competition defined?". The following list of contests are ones that I expect to be feasible options, but you are welcome to explore others as well. Please check with me and describe your plan of attack before you get too involved in a contest not on this list. Note that the "getting started" challenges are unlikely to be acceptable options.

Submitting

Before the deadline, you need to submit the following things through git:

The python code you wrote to implement your expermints.
A README file explaining how to run your code and access your data set(s).
The latex file with your write-up: project.tex

In addition, you must turn in a hard copy of the writeup pdf outside my office.

In the LaTex file, project.tex, you will describe your project. This file already contains a basic structure that you should follow. Feel free to change the section headings, or to add additional sections. Recall that you use pdflatex to convert the LaTex into a pdf file.

As your project develops and you create more files, be sure to use git to add, commit, and push them. Run: git status to check that all of the necessary files are being tracked in your git repo. Don't forget to update the README so that I can test your code!