CS97 Lab 2: Performance analysis of an unclustered B+ tree index

Due: 11:59 p.m., Sunday, 18 September.

Lab 2 is the third part of a four-part series in which you will implement a query analyzer that selects the (estimated) best access method for range search queries. In this part you will experimentally analyze the performance of a B+ tree implementation. Your goal is to formulate a mathematical model of its performance to aid your implementation of the query analyzer in lab 3.

Compared to the previous labs, this lab is much less directed and requires minimal coding. Your main deliverable is a written description of your mathematical model and your experimental derivation of it. For this lab you are required to work with a partner.



A performance model for an unclustered B+ tree index

Your goal is to develop a performance model for using an unclustered B+ tree index for simple range selection queries such as

  SELECT * FROM Employee WHERE salary > min AND salary < max
for some provided min and max. You may assume that the key for the B+ tree is the column that constrains the range for the query (here, salary) and that the value in the B+ tree is the RID of the record in a separate heap file. To execute a query like this using such a B+ tree, the database system first uses the B+ tree to obtain the record IDs for all matching records, and then uses each matching RID to retrieve that record from the heap file.

Your performance model should predict the time needed to execute a range selection query as a function of various input parameters. You should consider at least the following parameters as you develop your model:

  1. number of data pages in the relation
  2. number of records per page in the relation
  3. typical fan-out of the B+ tree internal nodes
  4. number of result records in the range selected by the query
In Section 8.4 of Ramakrishnan and Gehrke, the first property was labeled B, the second property R, and the third property F. If we label the fourth propery Q, assume that I/O is the dominant cost of a query and that each disk access requires 0.015 seconds, then Ramakrishnan and Gehrke's performance model would predict that the cost of a range selection query is:
  0.015 * (log_F(0.15 B) + Q) seconds

You should develop a similar performance model justified both by theory and experimental evidence by measuring the perfomance of range selection queries on computers in the Robot Lab. After deciding which parameters are worthy of investigation, you should design and execute a series of experiments that vary each parameter independently and measure the performance of range selection queries using the B+ tree index.

At the conclusion of this lab you should have a performance model that, given explicit values for the parameters you deem relevant, predicts the time (in seconds) needed to execute a query on a database and index that matches those parameters. You should evaluate the accuracy of your performance model on data based on the performance characteristics of the computers in the Robot Lab.



A B+ tree implementation

For this lab I've provided a templated B+ tree implementation, BPlusTree<K>, that uses memory-mapped I/O to index from some key type K to the BufferedRelation's RIDs. Like the BufferedRelation of lab 1, this BPlusTree implementation is unrealistic in that it both implements the B+ tree index and also manages memory allocation for the index files, a service that would ordinarily be performed database system-wide by an overall buffer pool manager. The supplied BPlusTree maintains a pool of in-memory index pages and selects an arbitrary page for replacement on disk when necessary.

Like the BufferedRelation, the BPlusTree constructor takes two arguments: the index name and the number of pages the BPlusTree is allowed memory map at any given time. The BPlusTree also provides a similar interface for inserting index records and executing range selection queries:

I've also provided a revised populate.cc program that populates an Employee database and B+ tree salary index, and an indexsearch.cc program that uses the B+ tree and heap file to implement range selection queries. See the instructions for populate and the analagous search in lab 1 for how to use these programs. The performance of the indexsearch program (and modifications thereof, for non-Employee data) is what you should measure for developing and evaluating your performance model,

The files supplied in this lab are intended to supplement your lab 1 solution; thus, I did not redundantly supply any files that you should already have from lab 1. You will gain access to the lab 2 files after you and your partner both submit your lab 1 solution.



Advice



Requirements of your experimental description



Submitting your work

Use handin97 lab2 to submit your work. As with lab 1, please do not submit any large data files when you hand in your work.