CS44 Lab 1: Building a Buffer Manager

Due by 11:59 p.m., Sunday, February 16, 2014
`
Introduction

This assignment is to be done with a partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy.

The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In this first assignment, you will build a buffer manager, on top of an I/O Layer that I provide. Future labs will add additional layers to the DB code.

The code base is quite extensive and will require much reading of API documentation and thorough testing. You are responsible for making significant progress early on the in assignment as waiting until the last few days will not be manageable. The goals of this assignment include:

To get started, run update44 to obtain the starting point files in ~/cs44/labs/1/. You should obtain the following files (files highlighted in blue require modification):

When you are ready to submit your lab, use handin44. Recall that only files in the ~/cs44/labs/1 subdirectory will be submitted. You may submit as many times as you wish; only the most recent copy will be saved.

The WiscDB I/O Layer

The lowest layer of the WiscDB database systems is the I/O layer. This layer allows the upper level of the system to:

This layer consists of two classes: a file (class File) and a page (class Page) class. These classes use C++ exceptions to handle the occurrence of any unexpected event. Implementation of the File class, the Page class, and the exception classes are provided to you. The code has been adequately commented to help you with understanding how it does what it does.

Before reading further you should first read the documentation that describes the I/O layer of WiscDB so that you understand its capabilities. In a nutshell the I/O layer provides an object-oriented interface to the Unix file with methods to open and close files and to read/write pages of a file. For now, the key thing you need to know is that opening a file (by passing in a character string name) returns an object of type File. This class has methods to read and write pages of the File. You will use these methods to move pages between the disk and the buffer pool.


A Buffer Manager

A database buffer pool is an array of fixed-sized memory buffers called frames that are used to hold database pages (also called disk blocks) that have been read from disk into memory. A page is the unit of transfer between the disk and the buffer pool residing in main memory. Most modern database systems use a page size of at least 8,192 bytes. Another important thing to note is that a database page in memory is an exact copy of the corresponding page on disk when it is first read in. Once a page has been read from disk to the buffer pool, the DBMS software can update information stored on the page, causing the copy in the buffer pool to be different from the copy on disk. Such pages are termed "dirty".

Since the database on disk itself is often larger than the amount of main memory that is available for the buffer pool, only a subset of the database pages fit in memory at any given time. The buffer manager is used to control which pages are memory resident. Whenever the buffer manager receives a request for a data page, the buffer manager checks to see if the requested page is already in the one of the frames that constitutes the buffer pool. If so, the buffer manager simply returns a pointer to the page. If not, the buffer manager frees a frame (possibly by writing to disk the page it contains if the page is dirty) and then reads in the requested page from disk into the frame that has been freed.

Replacement Policy and the Clock Algorithm

There are many ways of deciding which page to replace when a free frame is needed. Commonly used policies in operating systems are FIFO, MRU and LRU. Even though LRU is one of the most commonly used policies it has high overhead and is not the best strategy to use in a number of common cases that occur in database systems. Instead, many systems use the clock algorithm that approximates LRU behavior and is much faster.

The image to the left shows the conceptual idea of the clock array. Each square box corresponds to a frame in the buffer pool. Assume that the buffer pool contains numFrames frames, numbered 0 to numFrames-1. Conceptually, all the frames in the buffer pool are arranged in a circular list. Associated with each frame is a bit termed the refbit.

The algorithm is depicted in the flow chart to the right. At any point in time the clock hand (an integer whose value is between 0 and numFrames-1) is advanced (using modular arithmetic so that it does not go past numFrames-1) in a clockwise fashion. If a page is not valid (i.e., unoccupied), it is an obvious candidate for replacement. Otherwise, we check to see if the page is still pinned. We cannot remove a page from the pool it is still being used (pin count more than 0). If a page is not in use, we resort to the refbit to approximate our LRU algorithm. If the refbit is true, it has been recently unpinned so we reset the bet and move on. Otherwise, we have found a replacement. If the selected buffer frame is dirty (ie. it has been modified), the page currently occupying the frame is written back to disk. Otherwise the frame is just cleared and a new page from disk is read in to that location. Further details are available below.


Implementing the Buffer Manager

The WiscDB buffer manager uses three C++ classes: BufferManager, Frame and BufferHashTable. There is only one instance of the BufferManager class. A key component of this class is the actual buffer pool which consists of an array of numFrames frames, each the size of a database page. In addition to this array, the BufferManager instance also contains an array of numFrames instances of the Frame class that is used to describe the state of each frame in the buffer pool. A hash table is used to keep track of the pages that are currently resident in the buffer pool. This hash table is implemented by an instance of the BufferHashTable class. This instance is a private data member of the BufferManager class. These classes are described in detail below.



BufferHashTable

The BufferHashTable class is used to map file and page numbers to buffer pool frames and is implemented using chained bucket hashing. You must complete this implementation using the provided definition. The key structure is a HashItem, which stores one item in the hash table. This points to a file (File *) and stores the page number (PageId) in the file (these two combine to form the key). The HashItem also stores a frame number (FrameId) (the value associated with the key) to recover the page from the buffer pool. Lastly, the next pointer is used to implement separate chaining; the item points to the next HashItem in the bucket.



Frame

The Frame class is used to keep track of the state of each frame in the buffer pool. It is defined as follows:

First notice that all attributes of the Frame class are private and that the BufferManager class is defined to be a friend. While this may seem strange, this approach restricts access to Frame's private variables to only the BufferManager class. The alternative (making everything public) opens up access too far.

The purpose of most of the attributes of the Frame class should be pretty obvious. The dirty bit, if true indicates that the page is dirty (i.e. has been updated) and thus must be written to disk before the frame is used to hold another page. The pinCnt indicates how many times the page has been pinned. The refbit is used by the clock algorithm. The valid bit is used to indicate whether the frame contains a valid page. You you will need to implement the basic methods of the class (i.e., reset() and load()), just to ensure understanding of the purpose. reset() is invoked when a Frame is emptied and should reset all class variables to a default state. Note that rather than inventing your own constants for invalid states, you should take a look at the documentation to see if any constants have been defined already. For example, the Page class has defined a variable to indicate invalid page numbers (Page::INVALID_NUMBER). Also, the FrameId does not change for an individual frame; this should only be modified by the BufferManager.

load() is invoked after a page has been assigned to a frame; this method should set all member variables appropriately. This method is utilized for a new Page being loaded into the frame, so its pin count should be set to 1.



BufferManager

The BufferManager class is the heart of the buffer manager. This is where you write your code for this assignment. Note that, at a high level, the BufferManager manages a buffer pool of frames that contain pages. In the class definition, this is represented by two arrays: bufPool and frameTable. frameTable is the array of Frames that comprise the pool, holding meta information about each frame. The actual page being stored in a frame at a particular index (i.e., FrameId) is in the bufPool array. That is, if you would like to know the status the frame at FrameId 1, you access frameTable[1]. If you would like the Page stored at FrameId 1, you access bufPool[1]. As mentioned previously, hashTable is a directory that helps find a particular page in the pool quickly. That is, when checking the status of a certain Page object, we obtain it's FrameId by looking it up in the hash table.

Class Methods

This class is defined as follows:


Coding and Testing

We have defined this project so that you can understand and reap the full benefits of object-oriented programming using C++. Your coding style should continue this by having well-defined classes and clean interfaces. Reverting to the C (procedural) style of programming is not recommended and will be penalized. The code should be well-documented, using Doxygen style comments. Each file you are assigned to modify should start with your names and student ids, and should explain the purpose of the file. Each function should be preceded by a few lines of comments describing the function and explaining the input and output parameters and return values.

You should thoroughly test your program. This includes adding additional tests to main.cpp or defining new main programs (recommended). You will need to update the Makefile to process each test file separately.

Acknowledgements
The code base was provided and developed by the University of Wisconsin Database Group, with many modifications at the top level.
Submitting your lab

Submit using handin44. Please run make clean before submitting to keep file sizes down. Also, be sure to complete the README file.