CS97: Lab 2

Lab 2 is the third part of a four-part series in which you will implement a query analyzer that selects the (estimated) best access method for range search queries. In this part you will experimentally analyze the performance of a B+ tree implementation. Your goal is to formulate a mathematical model of its performance to aid your implementation of the query analyzer in lab 3.

Compared to the previous labs, this lab is much less directed and requires minimal coding. Your main deliverable is a written description of your mathematical model and your experimental derivation of it. For this lab you are required to work with a partner.

A performance model for an unclustered B+ tree index

Your goal is to develop a performance model for using an unclustered B+ tree index for simple range selection queries such as

  SELECT * FROM Employee WHERE salary > min AND salary < max

for some provided min and max. You may assume that the key for the B+ tree is the column that constrains the range for the query (here, salary) and that the value in the B+ tree is the RID of the record in a separate heap file. To execute a query like this using such a B+ tree, the database system first uses the B+ tree to obtain the record IDs for all matching records, and then uses each matching RID to retrieve that record from the heap file.

Your performance model should predict the time needed to execute a range selection query as a function of various input parameters. You should consider at least the following parameters as you develop your model:

number of data pages in the relation
number of records per page in the relation
typical fan-out of the B+ tree internal nodes
number of result records in the range selected by the query

In Section 8.4 of Ramakrishnan and Gehrke, the first property was labeled B, the second property R, and the third property F. If we label the fourth propery Q, assume that I/O is the dominant cost of a query and that each disk access requires 0.015 seconds, then Ramakrishnan and Gehrke's performance model would predict that the cost of a range selection query is:

  0.015 * (log_F(0.15 B) + Q) seconds

You should develop a similar performance model justified both by theory and experimental evidence by measuring the perfomance of range selection queries on computers in the Robot Lab. After deciding which parameters are worthy of investigation, you should design and execute a series of experiments that vary each parameter independently and measure the performance of range selection queries using the B+ tree index.

At the conclusion of this lab you should have a performance model that, given explicit values for the parameters you deem relevant, predicts the time (in seconds) needed to execute a query on a database and index that matches those parameters. You should evaluate the accuracy of your performance model on data based on the performance characteristics of the computers in the Robot Lab.

A B+ tree implementation

For this lab I've provided a templated B+ tree implementation, BPlusTree<K>, that uses memory-mapped I/O to index from some key type K to the BufferedRelation's RIDs. Like the BufferedRelation of lab 1, this BPlusTree implementation is unrealistic in that it both implements the B+ tree index and also manages memory allocation for the index files, a service that would ordinarily be performed database system-wide by an overall buffer pool manager. The supplied BPlusTree maintains a pool of in-memory index pages and selects an arbitrary page for replacement on disk when necessary.

Like the BufferedRelation, the BPlusTree constructor takes two arguments: the index name and the number of pages the BPlusTree is allowed memory map at any given time. The BPlusTree also provides a similar interface for inserting index records and executing range selection queries:

void insert(K key, RID value): inserts a key-value pair into the index. The key does not need to be unique within the index.
vector<RID> getRange(K min, K max): given minimum and maximum key values, returns an STL vector containing all RIDs associated with keys in that range (inclusive).

I've also provided a revised populate.cc program that populates an Employee database and B+ tree salary index, and an indexsearch.cc program that uses the B+ tree and heap file to implement range selection queries. See the instructions for populate and the analagous search in lab 1 for how to use these programs. The performance of the indexsearch program (and modifications thereof, for non-Employee data) is what you should measure for developing and evaluating your performance model,

The files supplied in this lab are intended to supplement your lab 1 solution; thus, I did not redundantly supply any files that you should already have from lab 1. You will gain access to the lab 2 files after you and your partner both submit your lab 1 solution.

Advice

As with lab 1, use a /local directory on some computer in the Robot Lab.
To determine the time needed to run a program, use the Unix time command. For example:
```
  $ time ./indexsearch 50000 52000
```
would report the time needed by the ./indexsearch program.
When determining the effect of an independent variable on the performance of the B+ tree, vary only one independent variable at a time. To vary the number of records per page in the relation, you'll need to use a record type other than the Employee class as provided in lab 1. To vary the fan-out you would need to index on a different-sized field than the salary of an Employee.
The provided BPlusTree implementation has a known bug if the search key is so large that only one key fits per index page (i.e., if keys are about 2KB or larger). Avoid testing conditions with extremely large keys.
Industrial-strength B+ tree implementations include a bulk-loading feature that greatly improves performance when repeatedly inserting many index key-value pairs. The provided B+ tree implementation does not support a bulk-loading feature and thus initializing a large index can be slow. For reference, creating a heap file and salary index for 10 million Employees required about 2.5 minutes on my home computer, but attempting to generate an index for 100 million employees caused the BPlusTree implementation to thrash badly. If you want you may attempt to adjust the buffer size allocated for the BPlusTree in populate.cc, but this might not succeed, either: the provided BPlusTree does not check the success of the mmap system call, and asking for too many buffers will eventually cause the OS to refuse allocating additional memory-mapped pages for the process.

Requirements of your experimental description

Your model should predict the time needed to execute a range search query on a B+ tree, as a function of whatever inputs you determine to be relevant.
Support your mathematical model with visual data, such as graphs that describe the results of your relevant experiments.
You must use Latex and the provided style files and hand in your document as a single PDF. Also include the Latex source and all image files needed to build your document.
Include any relevant references in a bibliography. You do not need to provide a Related Work section or other greater context for this lab.
There is no minimum or maximum length for your written work. I expect that most descriptions will be 2-3 pages in the Latex format.

Submitting your work

Use handin97 lab2 to submit your work. As with lab 1, please do not submit any large data files when you hand in your work.

CS97 Lab 2: Performance analysis of an unclustered B+ tree index