CS44 Lab 3: Building a Buffer Manager

Due by 11:59 p.m., Wednesday, March 4, 2015
`
Introduction

This assignment is to be done with a partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy.

The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In this first assignment, you will build a buffer manager, on top of an I/O Layer that I provide. Future labs will add additional layers to the DB code.

The code base is quite extensive and will require much reading of API documentation and thorough testing. You are responsible for making significant progress early on; waiting until the last few days will not be manageable. The goals of this assignment include:

To get started, run update44 to obtain the starting point files in ~/cs44/labs/3/. You should obtain the following files (files highlighted in blue require modification):

UPDATE: To prevent clutter, all common files have been moved to /home/soni/public/cs44/bufmgr. This applies to the contents of include,lib,exceptions. You can still read these files, they are just located in one common location rather than being copied to everyone's directories.

When you are ready to submit your lab, use handin44. Recall that only files in the ~/cs44/labs/1 subdirectory will be submitted. You may submit as many times as you wish; only the most recent copy will be saved.

The WiscDB I/O Layer

The lowest layer of the WiscDB database systems is the I/O layer. This layer allows the upper level of the system to:

This layer consists of two classes: a file (class File) and a page (class Page) class. These classes use C++ exceptions to handle the occurrence of any unexpected event. Implementation of the File class, the Page class, and the exception classes have been provided for you. The code has been thoroughly documented to help you understand how it operates.

Before reading further you should first read the documentation that describes the I/O layer of WiscDB so that you understand its capabilities. In a nutshell the I/O layer provides an object-oriented interface to the Unix file system with methods to open and close files and to read/write pages of a file. You will utilize these methods to move pages into the buffer pool and higher-level methods (e.g., main.cpp to process a query.


The Buffer Manager

A database buffer pool is an array of fixed-sized memory buffers called frames that are used to hold database pages (also called blocks) that have been read from disk into memory. A page is the unit of transfer between the disk and the buffer pool residing in main memory. Most modern database systems use a page size of at least 8,192 bytes. Another important thing to note is that a database page in memory is an exact copy of the corresponding page on disk when it is first read in. Once a page has been read from disk to the buffer pool, the DBMS software can update information stored on the page, causing the copy in the buffer pool to be different from the copy on disk. Such pages are termed "dirty".

Since the database on disk itself is often larger than the amount of main memory that is available for the buffer pool, only a subset of the database pages fit in memory at any given time. The buffer manager is used to control which pages are memory resident. Whenever the buffer manager receives a request for a data page, the buffer manager checks to see if the requested page is already in the one of the frames that constitutes the buffer pool. If so, the buffer manager simply returns a pointer to the page. If not, the buffer manager frees a frame (possibly by writing the page to disk if it is dirty) and then loads the requested page from disk into the newly available frame.

Replacement Policy and the Clock Algorithm

There are many ways of deciding which page to replace when a free frame is needed. Commonly used policies in operating systems are FIFO (first in first out), MRU (most recently used), and LRU (least recently used). LRU, arguably the most useful policy, suffers from high overhead costs due to the need for a priority queue. An alternative approach, the circular array buffer, or clock algorithm, approximates LRU behavior with much better run time performance.

The image to the left shows the conceptual idea of the clock array. Each square box corresponds to a frame in the buffer pool. Assume that the buffer pool contains numFrames frames, numbered 0 to numFrames-1. Conceptually, all the frames in the buffer pool are arranged in a circular list. Associated with each frame is a bit termed the refbit.

The algorithm is depicted in the flow chart to the right. At any point in time the clock hand (an integer whose value is between 0 and numFrames-1) is advanced in a clockwise fashion. If a page is not valid (i.e., unoccupied), it is an obvious candidate for replacement. Otherwise, we check to see if the page is still pinned since do not want to remove a page from the pool that is still being used. If a page is not in use, we resort to the refbit to approximate our LRU algorithm. If the refbit is true, the page has been recently unpinned and gets a "free pass" (i.e, set the bit to false and move on). Otherwise, we have found a replacement. If the selected buffer frame is dirty (i.e., it has been modified), the page currently occupying the frame is written back to disk. Regardless, the frame is cleared and the requested page from disk is read in to the freed frame. Further details are available below.


Implementing the Buffer Manager

The WiscDB buffer manager uses three C++ classes: BufferManager, Frame and BufferHashTable. There is only one instance of the BufferManager class. A key component of this class is the actual buffer pool which consists of an array of numFrames frames, each the size of a database page. In addition to this array, the BufferManager instance also contains an array of numFrames instances of the Frame class that is used to describe the state of each frame in the buffer pool. A hash table is used to keep track of the pages that are currently resident in the buffer pool. This hash table is implemented by an instance of the BufferHashTable class. This instance is a private data member of the BufferManager class. These classes are described in detail below.



BufferHashTable

The BufferHashTable class is used to map file and page numbers to buffer pool frames and is implemented using chained bucket hashing (separate chaining). You must complete this implementation using the provided definition. The key structure is a HashItem, which stores one item in the hash table (similar to a node in a linked list). This points to a file (File *) and stores the page number (PageId) in the file (these two combine to form the key). The HashItem also stores a frame number (FrameId) (the value associated with the key) to recover the page from the buffer pool. Lastly, the next pointer is used to implement separate chaining; the item points to the next HashItem in the bucket.

You have been provided a hash function as well as a constructor. Do not modify either, but instead read and understand how the work. You will need to complete the implementation for the destructor, insert, lookup, and remove methods as explained in the documentation (either the header file or document web page). Pay attention to corner cases (e.g., empty buckets) as well as the exceptions you will need to throw for errors.



Frame

The Frame class is used to keep track of the state of each frame in the buffer pool. It is defined as follows:

First notice that all attributes of the Frame class are private and that the BufferManager class is defined to be a friend. While this may seem strange, this approach restricts access to Frame's private variables to only the BufferManager class. The alternative (making everything public) opens up access too far.

The purpose of most of the attributes of the Frame class should be pretty obvious. The dirty bit, if true indicates that the page is dirty (i.e. has been updated) and thus must be written to disk before the frame is used to hold another page. The pinCnt indicates how many times the page has been pinned. The refbit is used by the clock algorithm. The valid bit is used to indicate whether the frame contains a valid page. You you will need to implement the basic methods of the class (i.e., reset() and load()), just to ensure understanding of the purpose. reset() is invoked when a Frame is emptied and should reset all class variables to a default state. Note that rather than inventing your own constants for invalid states, you should take a look at the documentation to see if any constants have been defined already. For example, the Page class has defined a variable to indicate invalid page numbers (Page::INVALID_NUMBER). Also, the FrameId does not change for an individual frame; this should only be modified by the BufferManager.

load() is invoked after a page has been assigned to a frame; this method should set all member variables appropriately. This method is utilized for a new Page being loaded into the frame, so its pin count should be set to 1.



BufferManager

The BufferManager class is the heart of the buffer manager. This is where you write your code for this assignment. Note that, at a high level, the BufferManager manages a buffer pool of frames that contain pages. In the class definition, this is represented by two arrays: bufPool and frameTable. frameTable is the array of Frames that comprise the pool, holding meta information about each frame. The actual page being stored in a frame at a particular index (i.e., FrameId) is in the bufPool array. That is, if you would like to know the status the frame at FrameId 1, you access frameTable[1]. If you would like the Page stored at FrameId 1, you access bufPool[1]. As mentioned previously, hashTable is a directory that helps find a particular page in the pool quickly. That is, when checking the status of a certain Page object, we obtain it's FrameId by looking it up in the hash table.

Class Methods

This class is defined as follows:


BufferStats

You can ignore BufferStats for now (although implementing it is trivial). There may be a future lab that asks you to run tests on your replacement policy to see how effective it is (e.g., how many disk accesses occur if we sequentially flood the buffer pool versus using random page accesses). However, we won't have time to do this before break so it is not required for this assignment.


Coding and Testing

Keep these style and testing guidelines in mind:

Acknowledgements

Much of the code base was provided and developed by the University of Wisconsin Database Group.


Submitting your lab

Submit using handin44. Please run make clean before submitting to keep file sizes down. Also, be sure to complete the README file.