CS44 Lab 5: Building a B+-Tree Index

Due by 11:59 p.m., Wednesday, April 16, 2014
`

Quick Links

This assignment is to be done with a partner. You may not work with other groups, and the share of workload must be even between both partners. Failing to do either is a violation of the department Academic Integrity policy.

Introduction

A key distinction in data structures designed for database management systems is the fact that data (i.e., relations) are stored in files which reside on disk. In addition to discussing the effect of data structure design, we must also discuss each file type's particular organization and how that organization lends itself to efficient evaluation of some (not all) of the following operations: scan, equality search, range search, insertion, and deletion.

When it is important to access a relation quickly in more than one way, a good solution is to use an index. For this assignment, the index will store data entries in the form < key, rid> (Alternative 2 for data entries in terms of index values). These data entries in the index will be stored in a file that is separate from the data file. In other words, entries in the index file "point to" entries in the data file where the actual records are stored.

Two primary kinds of indexes are hash-based and tree-based, and the most commonly implemented tree-based index is the B+ tree. In this assignment, you will implement a B+ tree, including the full search and insert algorithms as discussed in class. Your insert routine must be capable of dealing with overflows (at any level of the tree) by splitting pages. The goals of this assignment include:

There are number of design choices that you need to make, and you probably need to reserve a big chunk of time for testing and debugging. So, start working on this assignment early; you are unlikely to finish this project if you start is just a week before the deadline.

WiscDB and Getting Started

The goal of the WiscDB projects is to allow students to learn about the internals of a data processing engine. In this first assignment, you will built a buffer manager, on top of an I/O Layer that I provide to understand the management of main memory in a DBMS. In part two, you dove into the I/O Layer to implement a heap page.

In this last part, you will construct a B+ tree class to index relations on disk. In the previous two labs, we simplified the topic of records by not directly dealing with relations. In this lab, records will be more complex. This will require you to navigate a wide-range of concepts including efficiency, relations, indexes, and the use of low-level memory. As such, the depth of design choices and testing strategy will increase significantly. Furthermore, you will re-use many of the constructs from previous labs in your implementation, In particular, you will use (modified) page representations to map nodes on to disk and you will use a buffer manager to manage the interface between your B+ tree and the actual data on disk.

The code base is quite extensive and will require much reading of API documentation and thorough testing. Note: For this lab, I have attempted to directly link all class names to their API documentation. This should make understanding the interface for a class more streamlined for you. You are responsible for making significant progress early on the in assignment as waiting until the last few days will not be manageable.

To get started, run update44 to obtain the starting point files in ~/cs44/labs/5/. You should obtain the following files (files highlighted in blue require modification):

When you are ready to submit your lab, use handin44. Recall that only files in the ~/cs44/labs/5 subdirectory will be submitted. You may submit as many times as you wish; only the most recent copy will be saved.


I/O Layer: Modification to Files and Pages

To help get you started, we will provide you with an implementation of few new classes: PageFile, RawFile, and FileScanner. The PageFile and RawFile classes are derived from the File class. These classes implement a file interface in two different ways. The PageFile class implements the file interface for the File class as was done in the previous two WisdDB labs - a file consists of pages formatted using the Page class (i.e., a double-linked list of heap pages). We use the PageFile class to store all the relations.

The RawFile class implements the file interface for a less restricted file organization in which the pages in the file are not linked by prevPage/nextPage links, as they are in the case of the PageFile class. When reading/writing pages, the RawFile class treats the pages as blobs of 8KB size and does not require a certain structure (i.e., pages are not instances of the Page class). We will use the RawFile class to store the B+ index file, where every page in the file is a node from the B+ tree. Since no other class requires RawFile pages to be valid objects of the Page class, we can modify these pages as we wish without worrying that these pages will not be valid after their arbitrary modification. Inside the file btree.cpp you will treat the pages from a RawFile as your B+ tree index nodes, and the RawFile class will read/write pages for you from disk without modifying/using them in any way. The Page class has also been changed so that it does not use page objects to find out their page numbers.


The FileScanner class is used to scan records in a file. We will use this class for the base relation, and not for the index file. The file main.cpp file contains code which shows how to use this class. The public member functions of this class are described below.


A B+ Tree Index

Your assignment is to implement a B+ tree index. To simplify the task, we will make the following assumptions:

The index will be built directly on top of the I/O Layer (the RawFile and the Page classes). An index will need to store its data in a file on disk, and the file will need a name (so that the DB class can identify it). The convention for naming an index file is specified below. To create a disk image of the index file, you simply use the RawFile constructor with the name of the index file. Since the file is unformatted, you will need to implement a structure on top of the pages that you get from the I/O Layer to represent the nodes of the B+ tree. Furthermore, where the File class abstracted the creation a header page, you will need to directly allocate a page and format it to be the header page (i.e., store meta-data) for your index.

We'll start you off with an interface for a class, BTreeIndex. You will need to implement the methods of this interface as described below. You may add new private methods to this class if needed (hint: it should be), but do not modify the public interfaces that are described here:

Tips and Additional Details

Here is some additional information to help complete the lab:

Testing and Design

The majority of your time on this lab will be spent designing and testing your program. When designing your solution, be sure to consider the major hurdles:

You will need to design several short helper methods for common operations. In many cases, you can take advantage of templating. In others, templates help but you will need to condition your code on the attributeType - this arises due to the shortcomings of polymorphic code in real systems (e.g., the index needs to be aware of the actual type to be aware of the byte layout; key storage varies by type; nodes are not polymorphic). Your overall design may benefit from minimizing the amount of code that is dependent on the actual type. If possible, push the need to check the type as late as possible, and into specialized helper methods (see the example on Piazza for extracting keys).

You should develop a testing strategy that parallels your design. The provided test is not useful for incremental development. You should devise smaller tests. Ideas include:

Submitting your lab

Submit using handin44. Please run make clean before submitting to keep file sizes down. Also, be sure to complete the README file.