CS44 Lab 5: Building a B+-Tree Index

Due by 11:59 p.m., Sunday, April 12, 2015
`

Quick Links

This assignment is to be done with a partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy.


Introduction

A key distinction in data structures designed for database management systems is the fact that data (i.e., relations) are stored in files which reside on disk. In addition to discussing the effect of data structure design, we must also discuss each file type's particular organization and how that organization lends itself to efficient evaluation of typical operations: scan, equality search, range search, insertion, and deletion.

When it is important to access a relation quickly in more than one way, a good solution is to use an index. For this assignment, the index will store data entries in the form < key, rid> (Alternative 2 for data entries in terms of index values). The actual tuples are stored in a file separate from the index; entries in the index file "point to" the location of the tuple using a RecordId.

Two primary kinds of indexes are hash-based and tree-based, and the most commonly implemented tree-based index is the B+ tree. In this assignment, you will implement a B+ tree, including the full search and insert algorithms as discussed in class. Your insert routine must be capable of dealing with overflows (at any level of the tree) by splitting pages. The goals of this assignment include:

There are number of design choices that you need to make, and you probably need to reserve a big chunk of time for testing and debugging. So, start working on this assignment early; you are unlikely to finish this project if you start is just a week before the deadline.

WiscDB and Getting Started

The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In this first assignment, you will built a buffer manager, on top of an I/O Layer that I provide to understand the management of main memory in a DBMS. In part two, you dove into the I/O Layer to implement a heap page.

In this last part, you will construct a B+ tree class to index relations on disk. In the previous two labs, we simplified the topic of records by not directly dealing with relations. In this lab, records will be more complex. This will require you to navigate a wide-range of concepts including efficiency, relations, indexes, and the use of low-level memory. As such, the depth of design choices and testing strategy will increase significantly. Furthermore, you will re-use many of the constructs from previous labs in your implementation, In particular, you will use (modified) page representations to map nodes on to disk and you will use a buffer manager to manage the interface between your B+ tree and the actual data on disk.

The code base is quite extensive and will require much reading of API documentation and thorough testing. Note: For this lab, I have attempted to directly link all class names to their API documentation. This should make understanding the interface for a class more streamlined for you. You are responsible for making significant progress early on the in assignment as waiting until the last few days will not be manageable.

To get started, run update44 to obtain the starting point files in ~/cs44/labs/5/. You should obtain the following files (files highlighted in blue require modification):

As in previous labs, the following directories are in a shared directory (/home/soni/public/cs44/btree). They are already tied into your make targets and are readable for your reference:

When you are ready to submit your lab, use handin44. Recall that only files in the ~/cs44/labs/5 subdirectory will be submitted. You may submit as many times as you wish; only the most recent copy will be saved.


I/O Layer: Modification to Files and Pages

To help get you started, I have provided you with an implementation of few new classes: PageFile, RawFile, and FileScanner. The PageFile and RawFile classes are derived from the File class. These classes implement a file interface in two different ways. The PageFile class implements the file interface for the File class as was done in the previous two WisdDB labs - a file consists of pages formatted using the Page class (i.e., a double-linked list of heap pages). We use the PageFile class to store all the relations.

The RawFile class provides an abstraction where there is minimal internal structuring of the data. This allows us to implement our own file organization on top of the RawFile without having our data corrupted in the process. Specifically, the RawFile treats pages as completely unformatted chunks of memory (e.g., 8KB). There is no slot directory or page pointers built in. This allows us to use the page as essentially one large chunk of memory (like the data array from the previous lab). We will use the RawFile class to store the B+ index file, where every page in the file is a node from the B+ tree that we manually format.


The FileScanner class is used to scan records in a file. We will use this class for the base relation, and not for the index file. The file main.cpp file contains code which shows how to use this class. The public member functions of this class are described below.


A B+ Tree Index

Your assignment is to implement a B+ tree index. To simplify the task, we will make the following assumptions:

The index will be built directly on top of the I/O Layer (the RawFile and the Page classes). An index will need to store its data in a file on disk, and the file will need a name (so that the DB class can identify it). The convention for naming an index file is specified below. To create a disk image of the index file, you simply use the RawFile constructor with the name of the index file. Since the file is unformatted, you will need to implement a structure on top of the pages that you get from the I/O Layer to represent the nodes of the B+ tree. Furthermore, where the File class abstracted the creation a header page, you will need to directly allocate a page and format it to be the header page (i.e., store meta-data) for your index.

We'll start you off with an interface for a class, BTreeIndex. You will need to implement the methods of this interface as described below. You may add new private methods to this class if needed (hint: it should be), but do not modify the public interfaces that are described here:

Tips and Additional Details

Here is some additional information to help complete the lab:

Testing and Design

The majority of your time on this lab will be spent designing and testing your program. When designing your solution, be sure to consider the major hurdles here are two types of nodes - leaf and non-leaf nodes - and each has its own structure. You will need to design several short helper methods for common operations.

You should develop a testing strategy that parallels your design. The provided test is not useful for incremental development. You should devise smaller tests. Ideas include:

Submitting your lab

Submit using handin44. Please run make clean before submitting to keep file sizes down. Also, be sure to complete the README file.