CS44 Lab 7: Swat Movie Database (sMDB)

Due by 11:59 p.m., Friday, May 2, 2014
`

Quick Links

This assignment is to be done with a partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy.

Introduction

In this lab, you will create a relational movie database and pose a set of queries using SQLite. The raw data has been extracted from IMDB. You will structure this data in 6 tables to represent Movies, Actors, Directors, Genres, the relationship between director(s) for each movie (DirectMovie), and the casts of each movie (StarIn). Your schema is closely related to the following ER diagram:



In addition, you will use embedded SQL to write a Python program to interface with the sqlite3 engine. While there is a bit of a learning curve to picking up the Python library for sqlite3, the SQL commands are equivalent to those you would enter on the normal command-line interface.

The objectives for your lab are:

To get started, run update44 to obtain the files for this lab. You will utilizing the following files (those that require modification are in blue):

Next, you should read a few references on using the sqlite library in Python:


Creating the movie database

In createDB.py, you will place your code to create the movie database. You will create tables, insert values into the table, and create indexes on the table as needed to solve your queries efficiently. The code has been partially provided to help you get started.

Schema

The schema is as follows:

Actor (id, fname, lname, gender)
Movie (id, name, year)
Director (id, fname, lname)
StarIn (actorID, movieID, role)
DirectMovie (directorID, movieID)
Genre (movieID, genre)

All id fields are integers, as is year. All other fields are character strings. You can use either CHAR(N) or VARCHAR(N) for the character strings. Generally, stick to a maximum of 30 characters for any name field; 50 for role and genres; 1 character for gender.

The keys are specified in italics above. To sum up: id is the key for Actor, Movie, and Director. For StarIn, there seem to be situations where an actor has credited in the same role multiple times. So do not specify any primary key. For the remaining two, the primary key is the combination of all attributes. This is because an actor can appear in a movie many times in different roles; each movie can fit multiple genres; and each movie can have multiple directors.

The foreign keys should be clear from the context. StarIn.actorID references Actor.id. StarIn.movieID and DirectMovie.movieID reference Movie.id. DirectMovie.directorID references Directors.id The IMDB dataset is not perfectly clean and some entries in Genre.movieID refer to non-existing movies. This is the reality of "messy" data in the real world. In this instance, we drop the constraint rather than cleaning up the data. Do not specify that Genre.movieID is a foreign key.

Required methods

You will define the following functions: You may additional methods as needed (for example, to help with inserting). Each method must be commented and clear to follow. Please read about using multi-line strings below to avoid line overflows.


Querying the database

In swatMovieDB.py, you will define your queries and implement a user interface for interacting with the database. Your main method should establish a connection in a similar fashion as createDB.py: read the name of the database from the command line, check to see if the file exists (exit cleanly if it does not), establish a connection and cursor.

Next, your program should repeatedly print a menu of options until the user selects "Exit" as an option. The menu has been provided for you in printMenu. Please do not change this method. After the user enters a choice, you should call the appropriate query.

Be sure to follow these requirements:

  • Each query should be defined in a function queryX where X is the number below. For example the first query below should be defined in query1().
  • Your queries should be easy to read. Your SQL commands are sent as strings through the execute method. Since these queries can be long, your strings will be messy to follow. You should use Python's multi-string format (i.e., """Long string""" to make your queries easy to parse. For example, to query the number of female actors in the Actor table:
      command = """
                SELECT COUNT(*)
                FROM Actor A
                WHERE A.gender = 'F'
                """
      db.execute(command)
      results = db.fetchall()
    	
  • You are allowed to create temporary tables to store intermediate queries. But you must drop temporary tables as soon as you complete the query.
  • You should format your results in an easy to read manner. It is advised you write a function for printing results that you can call after each query. You should first print out the column headers (take a look at db.description which has a set of tuples with the column name as the first element in each tuple). Then print out each returned tuple.

    You will need to answer the following queries:

    1. List the first and last names of all the actors who were in the movie 'The Princess Bride'
    2. Ask the user for a first name and last name for an actor, and print all the movies starring an actor with that name.
    3. Ask the user for the first and last name of two actors. Print the names of all movies in which those two actors co-starred (i.e., the movie starred both actors).
    4. List all the actors who acted in a film before 1900 and also in a film after 2000. (That is: < 1900 and > 2000).
    5. List all directors who directed 500 movies or more, in descending order of the number of movies they directed. Return the directors' names and the number of movies each of them directed.
    6. Find Kevin Bacon's favorite co-stars. Print all actors (first name and last name) as well as the number of movies that actor has co-starred with Kevin Bacon. Sort by number of movies in descending order; remove actors with fewer than 8 co-starring opportunities with Mr. Bacon. Obviously, be sure that Kevin Bacon isn't in your results.
    7. We want to find actors that played five or more roles in the same movie during the year 2010. Notice that StarIn may have occasional duplicates, but we are not interested in these: we want actors that had five or more distinct roles in the same movie in the year 2010. Write a query that returns the actors' names, the movie name, and the number of distinct roles that they played in that movie (which will be >= 5).

    Example Run

    It is up to you to evaluate results. I have provided sample output for some of the queries to help understand the output. I have provided an estimate (it should be close to this number) for the number of tuples some queries will return. Sample output is available on this page. UPDATE: New and improved Sample Output.


    Tips and additional details
    • You will want to divorce SQL errors from Python errors as much as possible. Since SQL queries are roughly the same in both sqlite and the Python interface to sqlite, you should practice running your queries using the sqlite command-line interface. Note that a DB created using Python can be inspected by loading it up in sqlite. For example:
        $ python createDB.py /scratch/asas/movie.db
        $ sqlite3 movie.db
        SQLite version 3.7.9 2011-11-01 00:52:41
        Enter ".help" for instructions
        Enter SQL statements terminated with a ";"
        sqlite> .table
        Actor        DirectMovie  Director     Genre        Movie        StarIn     
        sqlite> 
      	
    • This data set is pretty large. createdb.py may take awhile to insert all of the values and construct the indexes (~10 minutes). Your queries, however, should all run fairly quickly.
    • Coming up with a generic print function for your query results can seem difficult. The description attribute of your Cursor object (i.e., db.description) can be helpful. In particular, the length of description is equal to the number of columns in the result. The first item in each row of description is the column name. So, to print out the ith column name using 20 spaces:
      	print "%-20s" % db.description[i][0]
      	
      To print each tuple, you can also allocate 20 spaces for each attribute value. In fact, you can create a format string pretty easily:
      	formString = "%-20s " * len(db.description)
      	
      Then, fill in the templates using a tuple. For example, if you store a result into a tuple called result, you can print it out simply:
      	print formString % result
      	
    • Creating the database can take a long time. Be sure to practice your commands in the SQLite environment first. Also, if you need to add to add an index after the fact, feel free to do this manually without having to reconstruct the db from scratch. Just be sure to test this works in your original code as well before handing in.
    • Query taking a long time? Try breaking up the query into multiple pieces, saving intermediate results using the CREATE TABLE name AS ... syntax. Sqlite may not optimize nested queries well in certain cases, so if the nest is a static table, it may be faster to pre-calculate the table.
    • In some cases, you might get a query closer to a minute depending on how you write the query. If your query is under a minute, you are probably fine. There is a fast (15 seconds or less) solution for all queries; you are only required to provide a correct solution that is reasonably efficient. For example, avoid a nested correlated query if there is a simpler solution using selection.


    Submitting your lab

    Submit using handin44. Also, be sure to complete the README file.