CS97: Lab 3

This lab is the culmination of a four-part series. Here you will combine your previous results to write a query analyzer that selects the (estimated) best access method for range selection queries, using either a B+ tree or a sequential scan of a relation's heap file.

If you run update97 you will obtain starting files for this lab. Those files partially implement a templated OptimizedSearch class, which already implements the actual range queries using either your BufferedRelation or the provided BPlusTree from the previous labs.

Your task is to complete the OptimizedSearch class by implementing the cost model that determines which access method is used. To do this you must complete three OptimizedSearch functions:

analyze(): Performs offline analysis of the relational data.
estimateIndexCost(K min, K max): Returns the estimated cost (in seconds) of a range selection query using the B+ tree for the given range.
estimateScanCost(): Returns the estimated cost (in seconds) of a range selection query using a sequential scan of the relation's heap file.

A query optimization overview

Query optimization is typically done in two separate phases:

Data analysis: a procedure which is executed periodically -- e.g., nightly, weekly, or monthly -- to analyze the characteristics of the relational data. This phase is executed offline and is permitted to be slow; a typical data analysis might need several seconds to be completed.
Query analysis: a procedure which is executed for each query to determine the best access methods for that query. Because this procedure is executed before each query, it must be relatively fast; a typical query analysis might be allowed only milliseconds or microseconds.

The goal of data analysis is usually to build look-up tables for whatever properties are needed by the query analysis, so that the query analysis can quickly and accurately estimate the cost of a given query.

For example, suppose that your B+ tree performance model from Lab 2 depends on the number of results for the given range selection query. To accurately estimate the cost of the query, therefore, you would need to first estimate the number of results of the query, without actually executing the query itself.

One way to estimate the number of results is to assume that the data is uniform and estimate the minimum and maximum key values in the entire relation, using those minimum and maximum key values to estimate the width of the relation. The number of results would then be

                                   (width of range being selected)
   (# of records in relation) *  -------------------------------------
                                 (width of range of entire relation)

If the data were not uniform, though, this technique could wildly misestimate the number of results. A better data analysis technique is to sample the relation's data and compute a histogram of the key values. This histogram can be built by sampling a relatively small portion of the relation, which can be easily accomplished in the few seconds typically allowed for the offline data analysis. When a query is being analyzed, the query analyzer can use the histogram to quickly estimate the number of query results.

Your work in this lab

Complete the analyze, estimateIndexCost, and estimateScanCost methods in optimizedsearch.inl to accurately select between the heap file scan or B+ tree index search.

Because you want to analyze the relational data separately from computing a query result, you might want to store the results of your data analysis persistently between executions of your program. The provided OptimizedSearch class opens and closes a file for you to use, if you wish. You can use this file by writing any data you wish to store in the OptimizedSearch destructor, immediately before the file is closed, and re-reading that file data on the program's next execution inside the OptimizedSearch constructor. See the comments in optimizedsearch.inl for more details.

When you are done, evaluate the performance of your implementation using Employee data generated by the generate-data.py program. In addition to your source code, please turn in a graph of your system's performance and a short (several paragraph) written evaluation of your work.

Please test your work with both the default (uniform) and --skewed options for generate-data.py. Your solution should choose a reasonable access method for both data distributions.

Submitting your work

Use handin97 lab3 to submit your implementation and your evaluation of your work. As with the previous labs please do not submit any large data files when you hand in your work.

CS97 Lab 3: Implementing a query analyzer