CS97 Lab 1: Designing a short experiment

Due: 11:59 p.m., Tuesday, 14 September.

This assignment has three main goals: (1) to familiarize you with database interactions, SQL, and query processing (2) to give you some hands-on practice measuring the performance of a database from the viewpoint of a database user, and (3) to give you practice documenting an experiment of your own design.

I've installed PostgreSQL and given you Postgres accounts on the lab computers, but you may use any modern relational database and any machine if you wish. You can log into the database on the lab computers using psql and then type Postgres or SQL commands at the prompt. I recommend that you work at the machine and not log in remotely -- or at least make sure that no one else is using the lab computer -- because your performance evaluation might use significant resources on the machine.

Useful Postgres commands are \help, \?, \d, \dt, \di, and \q.



1. Populate a database with data

Find a data source that you can use to populate a simple database, or use your favorite C compiler to compile my boring pseudorandom data generator, createEmployees.c. Your data should satisfy: (1) some non-primary-key data attribute must have a wide range of values, and (2) the data must be big enough to elicit interesting performance properties. For your experiments below to be interesting, you'll want to use a database large enough so that queries access data from disk.

Create an appropriate database schema using SQL. You'll probably want something like

CREATE TABLE tablename (        -- a table named tablename
    id INTEGER PRIMARY KEY,     -- an integer as the primary key
    foo VARCHAR,                -- a variable-length string called foo
    bar FLOAT8,                 -- a floating point number bar,
    t TIMESTAMP                 -- a time stamp, which you probably don't need
);    
where tablename, id, foo, bar, t, etc. are your own names.

For some of the performance evaluation, you'll want to create a secondary index on some column. The most basic format of the command is:

CREATE INDEX indexname ON tablename(columnname); 
where indexname, tablename, and columnname are your own names.

Populate your database with

\copy tablename  FROM 'filename'
or some variant of the COPY FROM command if your file is not tab-delimited. (Be warned that you might have permissions problems on the lab machines if you need to use the COPY FROM command.)




2. Determining the DB's access method

Consider a SQL range query that selects data from some relation based on data accessible from a secondary index:

SELECT * FROM tablename WHERE foo > somevalue;
or perhaps
SELECT * FROM tablename WHERE foo > somevalue AND foo < someothervalue;
depending on your data source. (Make sure that you've created a secondary index for column foo!)

For these range queries, in what circumstances does the database use a sequential scan of the data? When does it use the secondary index? Design and conduct an experiment to detect at what point the database changes its query plan. The Postgres \timing command is probably useful. You can check the actual query plan with the EXPLAIN command, but your experiment must not rely on EXPLAIN. (EXPLAIN's output is somewhat esoteric, but you can tell whether the query uses a sequential scan or the secondary index without needing to understand the details.)

After you've conducted the experiments above, run the ANALYZE command. This helps the database query optimizer make sane choices when generating an execution plan for your queries. Re-run your experiment. How does this affect your findings?



3. Document your experiments

Submit a written description of your experiment and any source code you use. Your goal is to document your experiment such that, from your written description, I understand and could easily reproduce your work. (I might try.) Be sure to include a description of your data and the commands you used to generate and load the database, as well as your experimental design and any result data needed to support your findings.

There is no guideline for how long your evaluation should be. Your goal is to describe your experiments as succinctly as possible while still being clear.