CS44 Lab 6: Tips for Performance

This page may be updated with more tips as we go, so refer back to this page.

One problem related to performance is related to where to store a large database. You cannot store it in your home directory due to its size being too big (and accesses to it would be too slow here as well).

There are a couple other options for storing files on our system, not all are equal:

  1. Store it in a subdirectory in /local/. /local is a disk partition that is local to a particular machine. You can only access it on the particular machine; each machine has its own /local file system and contents.

  2. Store it in a subdirectory in /scratch/. /scratch is an NFS (Network File System) partition that his hosted on the CS network file server. /scratch can be accessed from any lab machine; there is a single /scratch/ partition that is shared by all machines on our network, thus its contents are the same across all machines.

    Although /scratch sounds great, DO NOT use /scratch for this lab. Here is the problem: scratch is network space, and trying to create a database requires frequent fetching/retrieving of a large amount of data over the network. This is slow.

Here are some suggestions (I suggest doing #4 and either #1 or #2 depending on which part of the lab you are currently working):

  1. Save to local space (some information about using /local and /scratch):
    $ python createDB.py /local/me_and_pal/movie.db
    
    First you should create a subdirectory and set acls for you and your partner to access it:
    mkdir /local/me_and_pal
    easyfacl   # and follow prompts to enter uer names and directory name
    
    some information about acls and permisions

    PROS: this reduces run time about 20-fold (33 minutes down to 90 seconds for my python program)
    CON #1: local is the hard drive for a particular machine. If you log in to a different machine, you can't get the data. The work around is to move the file after creation which takes a few seconds. While annoying, you only need to do this once after you get createDB.py working:

    $ mv /local/me_and_pal/movie.db  /scratch/me/movie.db
    
    You can also use scp to copy from one machine's /local to another:
    # from cumin, cp movie.db in paprika's /local into cumin's /local
    [cumin] $ scp newhall@paprika:/local/me_and_pal/movie.db/local/me_and_pal/ .
    

    CON #2: if you are debugging you may accidentally leave some big files all over machines in the CS department. Be sure to clean up the /local disk if your file creation doesn't finish completely


  2. Use an in-memory database just for debugging, then write to disk once you have createDB.py working. This gives the same speed-ups as #1 without leaving files on a bunch of disks. To use in memory, create a DB connection as follows:
    connection = sqlite3.connect(":memory:")
    

  3. If you really want to get fancy, look into using transactions (changed isolation level to "DEFERRED", wrap "BEGIN TRANSACTION" and a commit() statement around your inserts). Also, execute "PRAGMA synchronous=OFF" to reduce the concurrency frequency. These gave me 2x speed-up.

  4. insert values for a table using executeMany. There are examples of this on the links provided on the lab write-up.