CS44: Database Management Systems

Lab 2: Heap Page Implementation

Due by 11:59 p.m., Saturday, October 27, 2018.

This is a partnered lab. You are to complete this lab with one other person, who must attend the same lab as you. You may discuss the lab concepts with other classmates. Please use Teammaker to set your lab partner preferences. You can choose “Random” if you do not have a partner. Remember the Academic Integrity Policy: do not show your code/solution to anyone outside of the course staff and your lab partner. Do not look at any other group’s code/solution for this lab (current or past). If you need help, please post on Piazza.

Overview

The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In the first assignment, you built a buffer manager on top of an I/O Layer that I provided. The aim was to understand the role of a system in managing main memory efficiently.

In this second component of the WiscDB labs, we will drill into the I/O layer to understand details of managing data in a file. You will complete a heap file implementation that stores data in variable-length records. This will allow you to directly simulate the physical structure of data on disk and help prepare you for future labs. Recall that a file is a collection of pages which themselves maintain a collection of records. I have provided you with the File class, which provides a logical ordering of pages for you (e.g., a linked list representation of pages). You must complete the I/O layer by implementing the Page class, which provides the physical layout of records on a disk page. The code base is quite extensive and, as with Buffer Manager lab, will require much reading of API documentation and thorough testing. You are responsible for making significant progress early on the in assignment as waiting until the last few days will not be manageable. The goals of this assignment include:

Getting Started

Click to expand

Both you and your partner will share a repo named Lab2-userID1-userID2. Note that the lab will not release until you have both marked your partner preferences on Teammaker. You should find your repo on the GitHub server for our class: CPSC44-F18

Clone your Lab 2 git repo containing starting point files into your labs directory that you created last week:

cd ~/cs44/labs
git clone [the ssh url to your your repo]
cd Lab2-userID1-userID2

If you need help, see the instructions on Using Git (follow the instructions for repos on Swarthmore’s GitHub Enterprise server).

If all was successful, you should see the following files (highlighted files require modification):

In addition, you should run make setup (just once) to create symlinks to shared libraries that are needed for your programs to run:

cd ~cs44/Lab2-userID1-userID2/
make setup

The WiscDB I/O Layer: Heap Files

The lowest layer of the WiscDB database systems is the I/O layer. This layer consists of two classes: a File and Page class. These classes use C++ exceptions to handle the occurrence of unexpected behavior. Implementation of the File class and the exception classes are provided to you. The code has been adequately commented to help you with understanding how it does what it does.

At a high-level, the File class abstracts the physical structure of files from upper-levels of the database. It stores a linked-page structure of heap-file pages that allow iteration of contents of a file and the storage of records. The File class allows the upper level of the system to:

Before reading further you should first read the documentation for the File class while reviewing the concept of a heap file in Chapter 9 of the textbook.

Heap Page

The Page class describes the structure and set of operations that can be performed on a slotted heap page with variable-length records. A page stores a set of records and supports operations such as:

Specifically, you will implement a variable-length slot representation. Review your material from lecture and the textbook. Our representation is very similar to the one depicted on Figure 9.7 in the text with the following alteration: the header and slot directory is stored at the beginning of the page (the directory grows forward) and the heap space is filled from back to front. The structure is illustrated below:

Your job in this assignment is to complete the implementation of the Page class (read the page.h file, implement page.cpp) and throughly test its mechanics. You will treat the page is a large, untyped array of bytes (e.g., a char array of size 8192), and manually format the contents. You will have to pay close attention to pointer types, and make use of sizeof to access different values within a page.

You are provided the following structures to represent a page object defined in page.h (DO NOT ADD or MODIFY DATA MEMBERS):

Implementation of the Page class

The Page class consists of several public methods that are utilized as the interface to a heap page as well as a set of private methods which are intended as either helper methods or restricted access methods to be used by friend classes (such as File and PageIterator). The following is a list of methods in no particular order. You should not simply implement each method one-by-one; you need to carefully plan your implementation in tandem with a testing strategy. Failure to do so will lead to significant problems as the data array structure is handled directly by you.

Provided methods

These methods have been implemented for you:

You should test these methods, but should not modify the implementations. I strongly suggest studying hasSpaceForRecord() to better understand the arithmetic required to manage the unformatted data array.

Basic accessor/mutator methods

Inserting a record

Records to be inserted into a page are passed as an C-style string (an array of char with a terminating '\0' character). When you insert a new record in a Heap Page, you should NOT store the terminating '\0' part of each record (i.e. the record "Tia" should take up 3 bytes of space in the page and not 4). An inserted record obtains an identifying RecordId that consists of a page number and slot id. NOTE: slot ids start at 1 not 0.

These are the set of methods involved in adding a record to a page:

Each of these steps should be completed using the helper methods as much as possible.

Retrieving a record

To obtain a record from a page, you will implement the following:

Delete a Record

There are two methods for deleting a record. We will actively compact the record space on all deletes; you will also implement compaction on the slot directory if a slot at the end of the directory is not in use and we are not doing an update.

Update Record

Updating a record can be done in multiple manners, but probably the simplest given our representation is to utilize a delete of the old record followed by a insert of the new version of the record.

void Page::updateRecord(const RecordId& record_id, const char* record_data)
This method replaces the data stored for an existing record. The new record data may not be the same size as the old record data. If it is larger, then there may not be enough space in the page to allow the update. If it is smaller or the same size, update should succeed. Updating a record should not change its RecordId. Follow these steps:

Coding and Testing

In the Buffer Manger lab, you spent most of your time understanding the tests I provided; only a few methods were not covered in those tests. For this lab, you will be given much less guidance by me. As usual, you are more than welcome to discuss testing strategies with your classmates (but not the testing code). Think carefully about how your implementation could go wrong. For example, are you properly calculating the free space available? You can test this by inserting the exact amount of data a page should fit and see if it succeeds. Then, see if you Page correctly fails if you try to insert even a little more.

You will want to try different testing strategies. I recommend utilizing a print method for debugging, where you print out all of the contents of the slot directory, header, and records. You should make your tests increasingly interconnected; that is, first try just doing inserts, than deletes, than a mixture of the two, then a mixture of updates/deletes/inserts, etc. Your tests should test corner cases (e.g., the page is 100% filled). When calculating the size of page, note that SIZE is 8192, sizeof(PageHeader) is 16 bytes (meaning data is of length 8176) and sizeof(PageSlot) is 6 bytes on our machines (2 bytes for each field). Use these values to determine exactly on how many records you can insert on a page.

Acknowledgements

The code base was provided and developed by the University of Wisconsin Database Group.

Submitting your lab

Before the due date, push your solution to github from one of your local repos to the GitHub remote repo. Only one of you or your partner needs to do this.

From your local repo (in your ~cs44/labs/Lab2-userID1-userID2subdirectory)

make clean
git add *
git commit -m "our correct, robust, and well commented solution for grading"
git push

If that doesn’t work, take a look at the “Troubleshooting” section of the Using Git page. Also, be sure to complete the README.md file, add and push it.