Lab 10: Machine learning project
Due May 9 by midnight


Starting point code

You are encouraged to work with a partner on this project. You are also encouraged to discuss approaches and share ideas with other groups, but each group's code and writeup must be produced independently. Please read the following instructions carefully.

  1. First, you need to run setup63 to create a git repository for the lab.

    If you want to work alone do:

    setup63-Lisa labs/10 none
    If you want to work with a partner, then one of you needs to run the following command while the rest one wait until it finishes:
    setup63-Lisa labs/10 partnerUsername
    Once the script finishes, the other partner should run it on their account.

  2. For the next step, only one partner should copy over the starting point files:
    cd ~/cs63/labs/10
    cp -r /home/meeden/public/cs63/labs/10/* .
    

  3. Whether you are working alone or with a partner, you should now add, commit, and push these files as shown below:
    git add *
    git commit -m "lab10 start"
    git push
    

  4. If you are working with a partner, your partner can now pull the changes in:
    cd ~/cs63/labs/10
    git pull
    

This week, the starting point directory includes just two files: project.tex and README.

In the LaTex file, project.tex, you will describe your project. We have provided a basic structure that you should follow. Feel free to change the section headings, or to add additional sections. Recall that you use pdflatex to convert the LaTex into a pdf file. Here is a template for your paper.

In the README file you will give instructions about how to test the code that you create.

Machine Learning Contests

In the past decade, a large amount of work in machine learning has been motivated by various contests and challenges. One of the best known was the Netflix prize (official site, Wikipedia), which offered $1M to the team that could improve the site's recommendation system by 10%. The Netflix prize was claimed in 2009; since then machine learning contests have become commonplace.

For this project you will be taking on a machine learning challenge of your own chosing. If you want to work on the Netflix prize data set, it is available in my public directory: ~bryce/cs63/labs/netflix/. Other sources of ML contests include: kaggle and Cha Learn. Some of these contests are currently active, with prizes available. However, you are welcome to work on contest problems that have expired or on other machine learning problems of your choosing. On kaggle, you can see old contests by selecting All Competitions and checking the completed box.

Many of these challenges involve large data sets that could quickly blow through your disk quota. To avoid this, you can save them to /scratch (instructions), which is unlimited, but isn't backed up.

Scikit-Learn

You are welcome to make use of your own implementations from previous labs, or of any additional machine learning algorithms. However, for all of the algorithms we have studied and many more, there are excellent publicly available Python libraries.

Scikit-learn is a collection of Python libraries that implement a large number of machine learning algorithms. You have already encountered it once, in lab 6, when the SVM code was imported from sklearn. The following links give documentation on using scikit-learn for many of the machine learning algorithms we have studied this semester.

Scikit-learn also has modules for preprocessing data and for evaluating models with cross validation. Feel free to poke around in the scikit-learn documentation for other tools and algorithms.

Ensemble Methods

The Netflix prize was open for nearly three years, during which time several research teams improved prediction scores using a variety of machine learning methods. The prize was ultimately won by a meta-group of research teams whose meta-algorithm ran each team's classifier in parallel and made recommendations based on their joint results. This outcome demonstrates the power of ensemble learning.

Ensemble learning is based on the idea that combining the output of many weak classifiers can make a strong classifier that outperforms all of its component parts. Ensemble learning is feasible as long as all of the component classifiers are useful (they perform better than random guessing) and not entirely redundant (they sometimes give different answers).

Two common methods for ensemble learning are boosting and bagging. Boosting works by training many weak classifiers (such as shallow decision trees) on the same data and then taking a plurality vote or weighted average over their outputs in order to classify a new point. Bagging works by training many highly specific classifiers (such as deep decision trees) on random subsets of the data, and again taking a vote or average over their output labels.

If it is appropriate to your selected machine learning task, you should try using an ensemble learning method. Your writeup should report the results of testing the ensemble against its component algorithms. Several variations on boosting and bagging are implemented by scikit-learn. Documentation can be found in the ensemble methods section.

Proposing Your Project

This project is meant to be open-ended and allow you to choose what machine learning topics you would like to explore further. If none of the contests sound appealing, you may propose alternative projects. Extending any of the machine-learning-related labs could be the basis for a project, and I am also open to other suggestions. If you want to pursue a non-contest project, be sure to talk to me about your ideas before you start significant coding.

Submitting your code

As your project develops and you create more files, be sure to use git to add, commit, and push them. Run: git status to check that all of the necessary files are being tracked in your git repo. Don't forget to update the README so that I can test your code!

Please turn in a hard copy of your writeup pdf outside my office before the deadline.