CS44 Lab 6: Movie Database

Due by 11:59 p.m., Tuesday, Dec 6, 2016

Quick Links

This assignment is to be done with your Lab Partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy. Please read over Expectations for Working with Partners on CS Lab work.


Introduction

In this lab, you will create a relational movie database and pose a set of queries using SQLite. The raw data has been extracted from the Internet Movie Database (IMDb). You will structure this data in 6 tables to represent Movies, Actors, Directors, Genres, the relationship between director(s) for each movie (DirectsMovie), and the casts of each movie (Casts). Your schema is closely related to the following ER diagram:



In addition, you will use embedded SQL to write a Python program to interface with the sqlite3 engine. While there is a bit of a learning curve to picking up the Python library for sqlite3, the SQL commands are equivalent to those you would enter on the normal command-line interface.

Lab Goals:

Lab 6 Starting point
First, find your Lab1-partner6-partner2 git repo off the GitHub server for our class: CS44-f16

Next, clone your Lab 6 git repo into your cs44/labs subdirectory:

cd
cd cs44/labs
git clone [the ssh url to your repo]
cd Lab6-partner1-partner2

If this didn't work, or for more detailed instructions on git see: the Using Git page (follow the instructions for repos on Swarthmore's GitHub Enterprise server).

Your Lab6 repo contains the following files (those that require modification are in blue):

Next, you should read a few references on using the sqlite library in Python:


Creating the movie database

In createDB.py, you will place your code to create the movie database. You will create tables, insert values into the table, and create indices on the table as needed to solve your queries efficiently. The code has been partially provided to help you get started. NOTE: do not use your home directory to create the actual database - this will eat up your quota very quickly.
See this page for tips on where to store your data: Tips for Performance and Storage.

Schema

The schema is as follows:

Actor (id, fname, lname, gender)
Movie (id, title, year)
Director (id, fname, lname)
Casts (actorID, movieID, role)
DirectsMovie (directorID, movieID)
Genre (movieID, type)

All id fields are integers, as is year. All other fields are character strings. You can use either CHAR(N) or VARCHAR(N) for the character strings. Generally, stick to a maximum of 30 characters for any name or title field; 50 for role and type; 1 character for gender.

The keys are specified in italics above. To sum up: id is the key for Actor, Movie, and Director. For the remaining relations, the primary key is the combination of all attributes. This is because an actor can appear in a movie many times in different roles; each movie can fit multiple genres; and each movie can have multiple directors.

The foreign keys should be clear from the context. Casts.actorID references Actor.id. Casts.movieID and DirectsMovie.movieID reference Movie.id. DirectsMovie.directorID references Directors.id The IMDb dataset is not perfectly clean and some entries in Genre.movieID refer to non-existing movies. This is the reality of "messy" data in the real world. In this instance, we drop the constraint rather than cleaning up the data. DO NOT specify that Genre.movieID is a foreign key.

Required methods

You will define the following functions: You may add additional methods as needed (for example, to help with inserting). Each method must be commented and clear to follow. Please read about using multi-line strings below to avoid unreadable code.


Querying the database

In queryDB.py, you will define your queries and implement a user interface for interacting with the database. Your main method should establish a connection in a similar fashion as createDB.py: read the name of the database from the command line, check to see if the file exists (exit cleanly if it does not), establish a connection and cursor. See the example run output below.

Next, your program should repeatedly print a menu of options until the user selects "Exit" as an option. The menu has been provided for you in printMenu. Please do not change this method. After the user enters a choice, you should call the appropriate query.

Requirements

Be sure to follow these requirements:

Queries

You will need to answer the following queries. Since there are many ways to write the same query, I ask that you sort your final results as specified to make comparisons easier. Additional attributes are attributes that should appear in your results but are not relevant to the sort ordering.


Query # Description Sort order Additional attributes
1 List the names of all actors in the movie "The Princess Bride" Actor's first name, Actor's last name
2 Ask the user for the name of an actor, and print all the movies starring an actor with that name (only print each title once). Movie title
3 Ask the user for the name of two actors. Print the names of all movies in which those two actors co-starred (i.e., the movie starred both actors). Movie id Movie title
4 List all directors who directed 500 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed. Number of movies directed Director's first name and last name
5 Find Kevin Bacon's favorite co-stars. Print all actors as well as the number of movies that actor has co-starred with Kevin Bacon (but only if they've acted together in 8 movies or more). Be sure that Kevin Bacon isn't in your results! Number of movies co-starred Co-stars first name and last name
6 Find actors who played five or more roles in the same movie during the year 2010. Number of roles, Movie title Actor's first name and last name
7 Programmer's Choice: develop your own query. It should be both meaningful and non-trivial - show off your relational reasoning skills! (But keep the query under a minute of run time)

Example Run

It is up to you to evaluate your query results for correctness and efficiency. However, Partial Sample Output shows partial output for some of the queries to help you understand what your program should do. The output includes an estimate (it should be close to this number) for the number of tuples queries will return, and the time it takes for my queries to run (see if you can beat my time).


Requirements
You will submit the following:
  1. Implement createDB.py to create the database (your .db file in /local) from the raw data files. It contains the definition of schema and indices, and it loads the raw data into each relation. See details about the required methods listed above.
  2. Implement queryDB.py. This is a main user program that loads your your database (.db), enters a loop asking user to select a query, and the performs the selected query, outputing the results. See detailed requirements listed above.
  3. Create a file named queryoutput.txt that is all the output from a run of your queryDB.py showing the results of your seven queries. Use script to capture terminal output to a file. Use dos2unix to clean up the file and also clean up any other stuff from the file by hand.
    script queryoutput.txt
    ....
    exit
    dos2unix -f output.txt
    
  4. README.md: answer questions about the design of your database in this file.
Tips and additional details


Submitting your lab
Before the Due Date, push your solution to github from one of your local repos to the GitHub remote repo.

From your local repo (in your ~you/cs44/labs/Lab06-partner1-partner2 subdirectory)

git add *.py
git add README.md
git add queryoutput.txt
git commit -m "our complete, correct, robust, and well commented solution for grading"
git push

If that doesn't work, take a look at the "Troubleshooting" section of the Using git page. Also, be sure to complete the questions in the README.md file, add commit, and push it.