CS97: Lab 1

This lab is the second lab in a four-part series in which you will implement several data access methods and a query cost estimator to choose the (estimated) best access method for simple data queries. This lab has two parts:

Use memory-mapped file I/O to implement filesystem-based physical storage for append-only relations.
Evaluate the performance of sequential scans of your relation.

You will create and use large data files to measure the time needed to sequentially scan large relations. Do not create these large files in your home directory, or any sub-directory thereof. Files in your home directory (and its sub-directories) count against your file quota and are stored on a non-local network drive, which might corrupt your performance measurements.

Instead, choose a computer in the robot lab and create a directory /local/<username> (where <username> is your username) on its local disk drive. Because this directory is on that computer's local drive, you will only be able to access the directory when logged into that specific computer.

The BufferedRelation and RelationalPage classes

In a typical database management system the physical storage model is implemented separately from the buffer pool manager, which reads pages from disk to memory and synchronizes and releases pages from memory after they are no longer needed. A typical buffer pool manager manages the buffer pool for the entire database system, not just for a single relation or file.

In this lab assignment you will implement a simplified BufferedRelation which defines the physical layout of a relational heap file and is responsible for the memory-mapping and synchronization of just that file (unlike with a typical buffer pool manager). The main data structure used by the BufferedRelation is the RelationalPage, a templated class that defines the memory layout and access methods for each single page stored in the heap file. In other words, the BufferedRelation is responsible for the bulk of the work in this assignment -- the layout, memory-mapping, synchronization, and unmapping for I/O of the overall relation -- while each RelationalPage is responsible just for the data stored on a single page within the heap file.

To simplify solutions for this lab and Lab 2, we will make several major assumptions:

Records in the relation are fixed-width and programmatically defined as a struct or class that can be used as a template parameter.
Relations are append-only: records in a relation are never modified or deleted.

These assumptions greatly simplify your data management and allow you to implement a packed page organization, like that of Figure 9.6 on page 328 of Ramakrishnan and Gehrke, without needing a bitmap or ever moving records.

To complete the first part of the assignment you should implement the classes defined in relationalpage.h and bufferedrelation.h, writing their implementations in relationalpage.inl and bufferedrelation.inl. You may add any private data members or functions you wish, but I recommend that you don't change the signature of functions I've already provided; you would also need to change some of the test code that I've provided (populate.cc and search.cc) for this lab and the next.

The BufferedRelation constructor takes two arguments:

const char* tableName, the name of the relation. This can be anything, and I use it only to determine the name of the file storing the relational data.
int bufferSize, the number of file pages the BufferedRelation is allowed to have mapped simultaneously. Your BufferedRelation implementation should respect the bufferSize limit, but this limit will not be otherwise enforced within the program.

The focus of this lab is on the disk performance of sequential scans -- not memory management techniques -- and you should not attempt to implement a complicated memory management algorithm. You should map and unmap pages in any way that results in reasonable performance for sequential scans given a realistic buffer size. (Note that the test programs allow 10000 pages, about 40 MB, to be open simultaneously.) For instance, my sample solution assumes that you are going to sequentially scan starting from a given page. If you try to access a file page that is not currently in memory, my BufferedRelation solution synchronizes and unmaps all mapped pages and then maps the next bufferSize pages starting from the page you requested.

Evaluating your implementation

In Lab 03 you will need to automate whether a sequential scan or index is used to answer simple range queries. To help prepare for that work, here you should measure the performance of sequential scans of your implementation as a function of the size of (i.e. number of pages in) the relational file.

To complete the second part of this lab, use the generate-data.py and populate.cc programs to create variously-sized employee databases, and then use search.cc to sequentially scan those databases to find all employees within a given salary range. Graphically plot the results of your experiment, showing the time needed to scan the database as a function of the database size.

Advice

You can complete this lab while editing only bufferedrelation.h, bufferedrelation.inl relationalpage.h, and relationalpage.inl.
Implement and test the RelationalPage before starting BufferedRelation. You might want to write a separate short test program to test your RelationalPage.
If you want to test your solution with files larger than 2GB, you'll need to avoid encoding offsets with ints; use long or off_t instead. Also, you'll need lseek64 instead of lseek.
After you've completed the first part of the lab you can generate a 1000-row sample employee database and search for large-salaried employees with:
```
  $ python generate-data.py 1000 --skewed > data.txt
  $ ./populate < data.txt
  $ ./search 500000 20000000
```
These programs use an Employee struct, which I've provided in employee.h and employee.cc. See generate.py for more information about its use.
A 1000-row database is small. Use a much larger database and ./search 5000000 20000000 once your implementation works for small databases. For reference, I tested my implementation with up to 100 million employees; it took about 10 minutes to populate that database on my home computer and sequential scans were reasonably fast.
Be sure that you don't allow a RelationalPage to extend past the page boundary. To help prevent bugs I recommend that each RelationalPage start and stop exactly on a file page boundary. Specifically I encountered three tricks to accomplish this:
1. Ensure that a RelationalPage exactly matches the page size by adding an unused data member to the end of the RelationalPage:
```
  char fill[<PAGESIZE> - sizeof(<THE_REST_OF_THE_DATA>)]; 
```
2. Store the BufferedRelation's persistent metadata in a different file from the relational data. For instance, for an "employee" relation my solution stores the relational data in a employee.dat file, but stores other information about the BufferedRelation (such as the number of used record slots) in a separate employee.meta file.
3. Don't make any of the RelationalPage functions virtual. If you do, C++ will add an internal vtable pointer to the RelationalPage for its implementation of the virtual functions, which will change the size of the RelationalPage.

Submitting your work

Use handin97 lab1 to hand in your completed work. If you code outside your cs97/labs/1 directory (such as in /local) be sure to hand in your most up-to-date solution. Please do not hand in any large data files. Also be sure to include an electronic version of the graphical plot of your system's performance in the handin directory.

CS97 Lab 1: An append-only relation