This lab is the second lab in a four-part series in which you will implement several data access methods and a query cost estimator to choose the (estimated) best access method for simple data queries. This lab has two parts:
You will create and use large data files to measure the time needed to sequentially scan large relations. Do not create these large files in your home directory, or any sub-directory thereof. Files in your home directory (and its sub-directories) count against your file quota and are stored on a non-local network drive, which might corrupt your performance measurements.
Instead, choose a computer in the robot lab and create a directory /local/<username> (where <username> is your username) on its local disk drive. Because this directory is on that computer's local drive, you will only be able to access the directory when logged into that specific computer.
In a typical database management system the physical storage model is implemented separately from the buffer pool manager, which reads pages from disk to memory and synchronizes and releases pages from memory after they are no longer needed. A typical buffer pool manager manages the buffer pool for the entire database system, not just for a single relation or file.
In this lab assignment you will implement a simplified BufferedRelation which defines the physical layout of a relational heap file and is responsible for the memory-mapping and synchronization of just that file (unlike with a typical buffer pool manager). The main data structure used by the BufferedRelation is the RelationalPage, a templated class that defines the memory layout and access methods for each single page stored in the heap file. In other words, the BufferedRelation is responsible for the bulk of the work in this assignment -- the layout, memory-mapping, synchronization, and unmapping for I/O of the overall relation -- while each RelationalPage is responsible just for the data stored on a single page within the heap file.
To simplify solutions for this lab and Lab 2, we will make several major assumptions:
To complete the first part of the assignment you should implement the classes defined in relationalpage.h and bufferedrelation.h, writing their implementations in relationalpage.inl and bufferedrelation.inl. You may add any private data members or functions you wish, but I recommend that you don't change the signature of functions I've already provided; you would also need to change some of the test code that I've provided (populate.cc and search.cc) for this lab and the next.
The BufferedRelation constructor takes two arguments:
The focus of this lab is on the disk performance of sequential scans -- not memory management techniques -- and you should not attempt to implement a complicated memory management algorithm. You should map and unmap pages in any way that results in reasonable performance for sequential scans given a realistic buffer size. (Note that the test programs allow 10000 pages, about 40 MB, to be open simultaneously.) For instance, my sample solution assumes that you are going to sequentially scan starting from a given page. If you try to access a file page that is not currently in memory, my BufferedRelation solution synchronizes and unmaps all mapped pages and then maps the next bufferSize pages starting from the page you requested.
In Lab 03 you will need to automate whether a sequential scan or index is used to answer simple range queries. To help prepare for that work, here you should measure the performance of sequential scans of your implementation as a function of the size of (i.e. number of pages in) the relational file.
To complete the second part of this lab, use the generate-data.py and populate.cc programs to create variously-sized employee databases, and then use search.cc to sequentially scan those databases to find all employees within a given salary range. Graphically plot the results of your experiment, showing the time needed to scan the database as a function of the database size.
$ python generate-data.py 1000 --skewed > data.txt $ ./populate < data.txt $ ./search 500000 20000000These programs use an Employee struct, which I've provided in employee.h and employee.cc. See generate.py for more information about its use.
char fill[<PAGESIZE> - sizeof(<THE_REST_OF_THE_DATA>)];
Use handin97 lab1 to hand in your completed work. If you code outside your cs97/labs/1 directory (such as in /local) be sure to hand in your most up-to-date solution. Please do not hand in any large data files. Also be sure to include an electronic version of the graphical plot of your system's performance in the handin directory.