Handling Large Databases on the CS Network

This page may be updated with more tips as we go, so refer back to this page.

One problem related to performance is related to where to store a large database. You cannot store it in your home directory due to its size being too big (and accesses to it would be too slow here as well). You may have used /scratch/ in other courses, but reading large files over the network is slow.

Here are some suggestions (I recommend doing #4 and #2. Once you have created your db, switch to strategy #1):

Save to local space (some information about using /local and /scratch):
```
$ python createDB.py /local/me_and_pal/movie.db
```
First you should create a subdirectory and set acls for you and your partner to access it:
```
mkdir /local/me_and_pal
easyfacl   # and follow prompts to enter user names and directory name
```
some information about acls and permisions

PRO: this reduces run time about 20-fold (33 minutes down to 90 seconds for my python program)

CON #1: /local is the hard drive for a particular machine. If you log in to a different machine, you can’t get to the data. The work around is to remember your machine and ssh to it every time.
Or, move the file after creation which takes a few seconds. While annoying, you only need to do this once after you get createDB.py working:
```
$ mv /local/me_and_pal/movie.db  /scratch/me/movie.db
```
You can also use scp to copy from one machine’s /local to another:
```
# from cumin, cp movie.db in paprika's /local into cumin's /local
[cumin] $ scp soni@paprika:/local/me_and_pal/movie.db/ /local/me_and_pal/ /local/me_and_pal/
```
CON #2: if you are debugging you may accidentally leave some big files all over machines in the CS department. Be sure to clean up the /local disk if your file creation doesn’t finish completely
Use an in-memory database just for debugging, then write to disk once you have createDB.py working. This gives the same speed-ups as #1 without leaving files on a bunch of disks. To use in memory, create a DB connection as follows:
```
connection = sqlite3.connect(":memory:")
```
If you really want to get fancy, look into using transactions (changed isolation level to "DEFERRED", wrap "BEGIN TRANSACTION" and a commit() statement around your inserts). Also, execute "PRAGMA synchronous=OFF" to reduce the concurrency frequency. These gave me 2x speed-up.
insert values for a table using executeMany. There are examples of this on the links provided on the lab write-up.