Machine Learning

Your project must include work related to the field of machine learning — it should relate to what we have covered in the course in terms of assignments and core lecture materials, but it should go beyond the basic course content in some way. It should also be considerably different than previous work you have done in other courses or for research. If you want to build on existing framework (yours or someone elses), you must be explicit about your unique contributions in your proposal. You can and should work with the instructors throughout the process of brainstorming and designing a proposal.

Project Scope

The first key piece of advice is to make sure the goal is feasible and manageable in the time you have for the semester. For example, project failures often relate to not having a good data set, or the approach does not match well to the problem, or because the debugging/implementation process requires more than 3 weeks of work. Look for papers related to your idea to ensure it is feasible. If it is difficult to find related work, treat it as a warning sign. Usually, it means it isn’t a suitable machine learning task, but it could also mean that you need to learn a bit more about the idea first (e.g. you may be searching the wrong keywords).

The second key piece of advice is that it’s extremely difficult to guess how long things will take when you don’t have much practice doing them, so you’re likely to make mistakes when predicting what’s 'feasible in the time available.' Therefore, you should ensure that your plan has many different 'endpoints' that are staged in sequence, with the easiest ones being things you are confident you can complete within a few days of starting your project, and the hardest being 'stretch-goals' that you’d like to complete, but think you probably won’t have time for. This way, however badly you over- (or under-) estimate the time it will take to do each part of your project, when the end of the semester rolls around you’ll have something to write a report and create a presentation about.

You are welcome (and encouraged) to use existing libraries and software if you want to spend less time on algorithm implementation and more on experiments and analysis. We can help with this if the libraries are not already on our systems.

If you are using libraries that make use of GPU resources, let us know and we can help guide you to the CS machines that have good GPUs in them; for GPU enabled libraries, this can lead to several orders of magnitude speed-up. Deep learning is especially likely to require a strong GPU, but there are a number of other algorithms that can leverage this resource if you use the right libraries.

Guidance

Within these bounds, there is a great deal of flexibility in the style of project you can consider. All papers are expected to have the following components:

Introduction - describe the problem, why it’s interesting, and who might be impacted by it. Provide background on existing work as appropriate.
Algorithm and Methods - describe the algorithms (e.g., in pseudocode), and analyze your approach and competing approaches (e.g., what is the inductive bias? What are empirical or theoretical properties of the algorithm that are relevant?).
Experiments and Analysis - run your chosen algorithm(s) on data and analyze the results using tools from the class (e.g., cross-validation, learning curves, PR or ROC curves, paired t-tests)
Ethics and Implications - describe the ethical implications and real-world impacts of your research (e.g. ways your algorithm(s) might impact fairness)

Note that while all projects must do each of these to some extent, your work does not have to be perfectly balanced between the three categories. For instance, your project could be very deep on algorithmic analysis (e.g., considering learning theory, or a very unique framework) or could be focused on experimentation (e.g., doing a meta-review of a class of algorithms, or by addressing a real-world problem that requires a good deal of effort in gathering, processing, and analyzing the data). There is a continuum of acceptable projects in these dimensions, but here are some examples:

Use an existing implementation to study a broader area (see below). Pick 3-5 algorithms and test them on ~5 data sets. Analyze the results and what they tell you about the approaches.
Attack a real-world dataset (not in a standard ML repository). It could be a challenge you find online or from your own interests. Develop an entire ML pipeline and discuss the series of algorithms needed for e.g., preprocessing, standardizing, training, visualizing results.
Implement one or two algorithms (from scratch). Analyze issues that are present in implementation and choices you have to make (including the effect of these choices e.g., on the hypothesis space). Pick a couple of data sets to evaluate your approach(es).
Evaluate the algorithmic bias in a system; pick a combination of algorithms and data sets and analyze the different types of bias that are present in the results. Suggest ways these biases might be minimized.

Example Project Outlines

Here are some example project outlines to give you a sense of the what you might do. Feel free to adapt these ideas or come up with your own!

Example 1: Fairness and machine learning

You might investigate metrics for measuring fairness in machine learning and try evaluating them on real data set. The goal would be to evaluate different metrics for fairness and different ways of mitigating unfairness. A good place to start would be reviewing chapter 3 of the fairness and machine learning book, which outlines several different mathematical definitions of fairness.

A great project report could take different directions:

Focus on metrics and experiments

Choose at least 3 different fairness metrics and discuss their implications.
Evaluate the metrics on at least 2 different datasets and discuss how the choice of dataset affects the fairness evaluation.
Compare the different classification models we’ve discussed in class and evaluate how well they perform in terms of both accuracy and fairness.
Test whether basics approaches like "fairness through unawareness" (removing the sensitive attribute from the dataset) are effective.
Provide detailed results and determine if the choice of model has any impact on fairness as well and any choices of hyperparameters.
Discuss the ethical implications of fairness in machine learning and which metrics are most appropriate in different contexts.

Focus on algorithmic approaches

Choose a fairness metric and a methods from existing literature for mitigating unfairness. This should include "fairness through unawareness" (removing the sensitive attribute from the dataset), and at least one more advanced method such as [post-processing](https://arxiv.org/pdf/1610.02413) or [modifying the learning algorithm itself to incorporate fairness constraints](https://arxiv.org/pdf/1507.05259).
Choose at least 2 different datasets to evaluate your approach.
Implement the method for at least one of the classifiers we’ve discussed in class and evaluate both the accuracy and fairness of the learned model.
Provide detailed results and determine if the method is effective at improving fairness without sacrificing too much accuracy
Discuss the ethical implications of fairness in machine learning and the trade-offs between accuracy and fairness.

Possible datasets: There are different options for finding a dataset, you could use a dataset that specifically has an attribute that could be considered sensitive (e.g., gender, race, age) or you could use a dataset where you can artificially specify a sensitive attribute (e.g., movie ratings where you could consider genre as a sensitive attribute). Common datasets used in fairness research include the (COMPAS dataset)[https://github.com/propublica/compas-analysis], the (Adult Income dataset)[https://archive.ics.uci.edu/dataset/2/adult], and the (German Credit dataset)[https://archive.ics.uci.edu/dataset/144/statlog+german+credit+data].

Example 2: Semi-supervised learning

Another good project could explore [semi-supervised learning](https://en.wikipedia.org/wiki/Weak_supervision), a branch of machine learning dedicated to problems, where labels may be missing from some of the training data.

A great project report could:

Choose 2-3 simple semi-supervised learning approaches. I would recommend starting with the two that are built into Scikit-learn: (self-training)[https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.SelfTrainingClassifier.html] and (label propagation)[https://scikit-learn.org/stable/modules/generated/sklearn.semi_supervised.LabelPropagation.html].
I would also recommend picking a method to implement yourself, such as [entropy minimization](https://proceedings.neurips.cc/paper_files/paper/2004/file/96f2b50b5d3613adf9c27049b2a888c7-Paper.pdf).
Choose at least 2 datasets to evaluate your approaches. You could use standard datasets like (Iris)[https://archive.ics.uci.edu/ml/datasets/iris] or (Digits)[https://archive.ics.uci.edu/ml/datasets/optdigits], or you could find a dataset that is more relevant to your interests. Any dataset will work as long as you can remove some of the labels to simulate a semi-supervised learning problem, though including a dataset that acutally has missing labels would be ideal.
Implement the approaches and evaluate them on the datasets. A great way to analyze semi-supervised learning is to vary the amount of labeled data available and see how that affects performance. You could also compare the semi-supervised approaches to fully supervised approaches using only the labeled data to see if they provide any benefit.
Report your finding and discuss which approaches worked best and under what conditions. Discuss any challenges you faced in implementing the methods and any insights you gained about semi-supervised learning.

Example 3: Time series forecasting

Time series forecasting is a common problem in machine learning, where the goal is to predict future values of a time series based on past observations. In this setting there aren’t nessecarily separate variables for features and labels. Instead you’d typically use previous labels as the features for predicting the next label. For example, if trying to predict the local temperature on a given day your 10 features could be the temperature for each of the previous 10 days. This is called an autoregressive approach. You might also include additional features, e.g. such as the wind, rain and/or humidity on each of the previous days in the temperature example.

A great project report could on this topic could:

Choose a relevant domain and dataset for time series forecasting. This could be anything from forecasting the weather to predicting disease trends. The UCI repository has a category for timeseries data.
Implement a basic autoregressive approach. This means being able to divide up your data into feature-label pairs as described above. You can use any of the model’s we’ve discussed in class to make the actual predictions.
Implement a multi-step forecasting approach. For example, if we want to predict the temperature for the next 3 days, we first need to predict tomorrow’s temperature and then use it as a feature for predicting the temperature 2 days out and so on.
Report and analyze the performance of your forecasting approach. Compare different prediction models for the task using an appropriate metric.
Analyze the aspects of forecasting. You should determine how the accuracy changes the further out you predict, as well as how the number of previous times used affects accuracy.
Optionally you could compare to a method specifically designed for forecasting, such as ARIMA. As this would be a bit complicated to implement yourself, consider using an existing implementation.
Report your findings and discuss what challenges you faced in implementation and what you took away about the timeseries forecasting problem.

Example 4: Image classification with deep learning

Image classification is one of the most visible and celebrated successes of machine learning in recent years, particularly with the rise of neural networks and deep learning. It’s surprisingly easy to implement an image classifier using the tools we already have, as an image can be represented as an observation with a large number of features, each corresponding to a single pixel value.

A great project report on this topic could:

Choose a simple image classification dataset. Good examples that are not too large would be Street View House Numbers (SVHN) and the CIFAR-10 classification dataset. You could also consider a more specialized domain, such as wildfire detection. Visualize the images in your chosen dataset(s).
Try applying some of the basic machine learning models that we’ve discussed in this class (e.g. Logistic Regression, KNN, Naive Bayes) to this problem and evaluate them with our standard metrics.
Try implementing a neural network for this problem. You should set up a basic pipeline with PyTorch that allows you to train the model and evaluate it on test data. You can look to online examples as starting points for this.
Try varying aspects of the network such as the number of layers, the activation function, the number of neurons per layer etc and report how each of these affects performance.
Try swapping out the basic neural network layers for convolutional layers. Report what changes you observed in the behaivior of your network.
Report your findings and discuss what challenges you faced in implementation and what you took away about image classification with deep learning.

Example : Text classification

Text understanding is another celebrated success of machine learning in the past decade. In this project you could explore classifying pieces of text into different categories.

A great project on this topic could:

Choose a simple text classification data. A common, and straightforward example is sentiment classification, where the goal is to determine how positive or negative a piece of writing, like a review or tweet, is about it’s subject.
Use a bag-of-words transformation to convert the text into a format suitable for input to a machine learning algorithm. You can use some of the helpers in Scikit-Learn for this.
Test out a few different classifers that we’ve learned about for this problem report which are the most accurate.
Analyze what one or more of these trained models tell us about the problem. A great thing to look at would be which particular words are most predictive of positive or negative classes, and analyze whether or not this makes sense.
Try out different variations of the bag-of-words transformation, such as term-frequency and TF-IDF weighting. Analyze which had the best performance on the task and whether they affected the analysis of which words were most predictive.
Report all of your findings, discuss challenges you faced and any takeaways about the text classification problem.

Topic Suggestions

Here are some possible project ideas to get you started; note that this list is in no way exhaustive, and you are more than welcome to come up wtih your own ideas that are not on this list.

Sampling bias in real-world data - examine data sets to determine how they fail to accurately represent the world and the impacts this has on models learned from the data.
Anomaly detection - work with highly imbalanced data set(s) to find ways to classify novel examples as 'normal' or 'abnormal' based on a training sample that consists almost exclusively of 'normal' samples.
Transfer learning or domain adaptation - algorithms that take what is learned on one task and apply it to another.
Semi-supervised learning - utilize large, unlabeled data sets to improve supervised learning.
Multiple-instance learning - the training examples are bag of items with one collective label for the entire bag but (unknown) labels for instances in the bag (typically negative means all instances are negative, but positive means at least one instance is positive but most are negative).
Multi-label learning - there are many properties we’d like to predict at once (i.e. an example can be a member of more than one class at a time).
Multi-view learning - constructing models from different modalities (or types of data, e.g., images and text) and then synthesizing them into one model. For example, use captions and images to identify objects in an image.
Active learning - human-in-the-loop interaction with learners.
Knowledge-based learning - incorporating human advice/expertise into a model to improve learning. This has been applied to neural networks and support vector machines, among others.
Privileged learning - some features are only available during training time. Instead of throwing them out, can we use them to help train the other features?
If you took the course due to interest in deep learning, you now have the tools to attack that topic in interesting ways. There are data sets (especially speech, text, and images) that can be realistically mined with neural networks. Additionally, the core topics in deep learning require an analysis of new algorithms (neural network advances are almost all exclusively related to the need for regularization since deep networks are prone to overfitting). You could consider the topic of pretraining networks with unsupervised data, transfer learning where weights from one task are used for a different task (e.g., classify cats vs dogs and then apply to brain images).
Dimensionality reduction or feature selection - how to eliminate features either for visualization or to prevent overfitting
Regression - predicting real-valued outputs
Rank prediction - predict the relative ordering of items rather than a fixed category
Time-series or relational prediction - remove the I.I.D. assumption and predict items that are related to one another.

Domains to be Cautious About

These are some data domains that you may want to avoid because they’re likely to be difficult to get satisfying results with. In general, when humans are in competition with each other, the outcome of those competitions tends to be difficult to predict. If you really want to work on a topic like this, make sure you have a plan for how to have a successful project even if prediction turns out to be impractical (e.g. all the available learning algorithms produce very low performance results). This can involve things like a focus on trying to draw conclusions about the nature of the data (i.e. focus on what you can discover that’s interesting and worthwhile, rather than on the fact that prediction is difficult).

Here are some examples of problem domains like this:

Outcomes of sporting events (football/soccer, basketball, etc.)
Outcomes of video games (e-sports, pokemon tournaments, etc.)
Values of markets (stocks, indexes, 'prediction markets,' GDP, etc.)
Other domains in which humans are directly competing against other humans

Resources

Look for research papers in top machine learning venues (NeurIPS, Journal of ML Research (JLMR), Int. Conference on Machine Learning (ICML), ICMLA (Applications), Euro. Conf. on ML (ECML), Int Conf. on Data Mining (ICMD), AAAI). Most of the papers are freely available online. The library should be able to get you access to the other materials although most authors have PDFs of their papers on their websites. For data sets:

Prof. Gabe’s page for (interesting data sources)[https://gabehope.github.io/DataVizSp25/resources/datasets.html]
UC Irvine Machine Learning Repository
Wikipedia’s listing of popular datasets
Kaggle Datasets
Quora discussion
The US government’s open data repository

CPSC 66: Final Project Ideas