CS44 Fall 2007

Project 4: B+ Tree Index

Due Wednesday Nov. 21 before 1am (late Tues night)

A Note on the Due Date: this assignment is significantly more difficult than the first three; not only does it build upon your understanding of the previous two coding assignments (as you must make use of both the Page classes and the Buffer Manager), you are also also called upon to implement a rather intricate data structure. Accordingly, I have given you extra time to complete this project; you have roughly three weeks for this assignment and you will need it.

Introduction to the Problem
Getting Started
Implementation Details
Code Design, Testing, and Extra Credit
What to Submit

Project 4 FAQ

I'll post answers to questions I get about project 4 here:
(also, by reading through the test code to see how B+Tree functions are called, you can often figure out what the functions are supposed to do).

destroyFile(): needs to free all pages of the index file (calling Buffer Manager functions to do this), unpin all index pages from the buffer pool, and destroy_file_entry for the B+Tree index. Basically, you are freeing all disk blocks taken up by the B+Tree index, you are freeing all its in-memory state, and you are deleting the index file entry from the system catalog.
File Scan functions:
- btreefilescan::get_next should return:
  - the first matching record when the scan is first created (if called after the constructor, for example)
  - the next matching record the second call, the next the 3rd call, ...
- To delete all records between a hi and low value, create a btreefilescan object, then call in a loop:
  1. btreefilescan::get_next()
  2. btreefilescan::delete_current()
  If the user calls delete_current() twice in a row without calling get_next() the second call to delete_current() should return an error (i.e. the current record has already been deleted).
- btreefilescan::delete_current() on a new scan (you can decide if this returns an error or if it deletes the first record in the scan)
Question: what do we do on delete?
Answer: in the basic assignment, delete the (key, rid) entry from the leaf page and nothing else (it is okay to have empty pages, you do not need to merge nodes, you do not need to redistribute)
Question: why do btleaf and btindex page get_first and get_next return NOMORERECS and the btree_driver code tests for DONE but not NOMORERECS.
Answer: BTreeFileScan should call get_first and get_next on btleaf and btindex pages, and BTreeFileScan will determine if the scan is done or not based on the B+Tree structure and based on the return value of these function calls. When the scan is done, the get_next function of BTreeFileScan should return DONE.
Question: How do I know the size of string keys? what is the max size?
Answer: string keys are null terminated and the max size of keys is defined in key.h as the constant MAX_KEY_SIZE1
Key.C contains routines for building (key,dataEntry) pairs, routines that take a (key,dataEntry) pair and divide it into its two parts, and a routine to compare key fields.
header page struct The struct definition for the header page is for you to define. I gave you one with the starting point just as an example; you may completely change it. However, remember it must map on top of an HFPage, so don't make it larger than a page.
The SortedPage class is already implemented for you. It is part of the library for this assignment, which means you can't read its source code but you can see its interface in its .h file in include
If you have a HFPage object, then SortedPage, BTLeafPage, etc. are types that map onto this page-sized chunk of bytes (i.e. their methods are manipulating bytes on this HFPage object but using their field names to interpret the bytes differently than in HFPage).
If you are inside a function of one of these classes (BTLeafPage, BTIndexPage), and you want to call a function from SortedPage or HFPage, one way to do it is to re-cast the this pointer to the approriate class type, then invoke the function you want from that class:
```
	   ((SortedPage *)this)->name_of_some_method_func_in_SortedPage(args...);
		 
```
Question: what are the parameters to insertRec in btleaf_page.h:
```
    Status insertRec(const void *key, AttrType key_type, RID dataRid, RID& rid);
    
```
the parameters key and dataRid make up the K* entry you are inserting on the leaf page ( K* entry is (key, dataRid)), the parameter rid, is the rid of the K* entry this function inserted (i.e. it is the page number, slot number of the K* entry on this leaf page). The insertRec function needs to set the value of rid after inserting the K* entry on to the leaf page (this function determines the slot number into which the (key, dataRid) is being inserted)
Question: what exactly is BTreeFileScan::keysize() function supposed to do?
Answer: it should return the keysize value for the B+Tree it is scanning (keysize is passed as an argument to the BTreeFile constructor)

1 Introduction

In this assignment, you will implement a B+ tree in which leaf level pages contain entries of the form [key, rid] (Alternative 2 for data entries in terms of the textbook). You must implement the full search and insert algorithms as discussed in class. Your insert routine must be capable of dealing with overflows (at any level of the tree) by splitting pages. Deletes will be handled by simply marking the corresponding leaf entry as 'deleted'. You do not need to implement merging of nodes on deletes, nor should you do sibling redistribution on deletes and inserts; inserting into a full page always causes a split.

You will be given HFPage and SortedPage. SortedPage is derived from HFPage, and it augments the insertRecord method of HFPage by storing records on the HFPage in sorted order by a specified key value. The key value will be included as the initial part of each inserted record to enable easy comparison of the key value of a new record with the key values of existing records on a page. Read the documentation in the header files to understand what class functions do.

You need to implement two pagelevel classes, BTIndexPage and BTLeafPage, both of which are derived from SortedPage. These page classes are used to build the B+ tree index. You also will write code to create, destroy, open and close a B+ tree index, and code that will open a scan on the B+ tree, allowing its caller to iterate through all of the data entries (from the leaf pages) that satisfy some search criterion.

2 Getting Started

Copy the starting point directories and files:

% mkdir cs44/proj4
% cp -r  /home/newhall/public/cs44/proj4  cs44/proj4/.

Contents:

src subdirectory:
- Makefile: A sample Makefile for you to compile your project. Set up any dependencies (as needed) by editing this file.
- Partial templates for btfile.h, btindex page.h, btleaf page.h, and btreefilescan.h : You should complete all these .h files (as needed) and also implement the methods in the corresponding .C files (which you will write from scratch).
- main.C,btree driver.C,keys: B+ tree test driver program and the ascii key data that will used by the testing program.
- sample_output: correct test output
include subdirectory:
You can find other useful include files in the include subdirectory :bt.h, hfpage.h, sorted_page.h, index.h,test_driver.h, btree driver.h, minirel.h and new_error.h.
lib subdirectory: library needed to build this assignment

3 Design Overview

You should begin by reviewing the chapter on Tree Structured Indexing in the textbook. Make sure you understand insert, and search operations on a B+ Tree. There is also information about the B+ tree layer in the Minibase HTML documentation.

3.1 A Note on Keys for this Assignment

You should note that key values are passed to functions using void * pointers (pointing to the key values). The contents of a key should be interpreted using the AttrType variable. The key can be either a string(attrString) or an integer(attrInteger), as per the definition of AttrType in minirel.h. If the key is a string, it has a fixed maximum length, MAX_KEY_SIZE1, defined in bt.h. There are C-style functions for creating,accessing files of, and comparing keys in Key.C.

Although the specifications for some methods (e.g., the constructor of BTreeFile) suggest that keys can be of (the more general enumerated) type AttrType, you can return an error message if the keys are not of type attrString or attrInteger.

The SortedPage class implements the insertRecord method by storing records on a page in sorted order according to a specified key value. It assumes that the key value is included as the initial part of each record, to enable easy comparison of the key value of a new record with the key values of existing records on a page.

3.2 B+ Tree Page-Level Classes

There are four separate pages classes, of which you will implement two. HFPage is the base class (given), and from it is derived SortedPage. You will derive BTIndexPage and BTLeafPage from SortedPage. Note that, as in the HFPage assignment, you must not add any private data members to BTIndexPage or BTLeafPage as they have to map onto the structure of a Page (i.e. they must all be the same number of bytes).

The relationship between the the page-level classes is shown in the figure below.

HFPage: This is the base class, you can look at hfpage.h to get more details. You can use the HFPage's data member type to indicate whether an HFPage object is a leaf node or index node of the B+Tree.
SortedPage: This class is derived from the class HFPage. Its only function is to maintain records on a HFPage in a sorted order. Only the slot directory is rearranged. The data records remain in the same positions on the page (for example, when a record is deleted from a sorted page, all subsequent record's rids slotNo change by -1). This exploits the fact that the rids of index entries are not important: index entries (unlike data records) are never `pointed to' directly, and are only accessed by searching the index page.
BTIndexPage: This class is derived from SortedPage. It inserts records of the type [key, pageNo] on the SortedPage. The records are sorted by the key.
BTLeafPage: This class is derived from SortedPage. It inserts records of the type [key, dataRid] on the SortedPage. dataRid is the rid of the data record with the matching key value. The records are sorted by the key. Further, leaf pages must be maintained in a doublylinked list. Use existing HFPage data members for linking leaf pages together.

Important Note: These classes do not have constructors. The information they store must be mapped onto a Page object. To create a new page of the desired type, do the following:

     BTLeafPage *newLeafPage; 
     PageId newId; 

     // allocate a new page from the BufMgr (it will be pinned in the BP)
     Status status = MINIBASE_BM->newPage( newId, ((Page *&)newLeafPage) ); 
     if(status != OK) { return ( MINIBASE_CHAIN_ERROR(BTREE, status) ); }

     // initialize this new leaf page
     newLeafPage->init(newId); 
     ...

     // now use it and remember to unpin it when you are done with it

Look at the header files for each class for further details about the individual methods.

Lastly, you will need to create a structure to represent the header page of the B+ tree. Despite its name, the data structure used to represent the header page need not be derived from a Page object. It can be implemented simply as a C++ struct, with a field for each piece of information that must be stored in the header page. Just remember to cast pointers to this struct as (Page *) pointers when making calls to functions such as pinPage().

3.3 Other B+ Tree Classes

A BTreeFile contains a header page and a number of BTIndexPages and BTLeafPages. The header page is used to hold information about the tree as a whole, such as the page id of the root page, the type of the search key, the length of the key field(s) (which has a fixed maximum size in this assignment), etc. When a B+ tree index is opened, you should read the header page first, and keep it pinned until the file is closed. Given the name of the B+ tree index file, you can locate its header page if its file name has been registered with the DB object. The DB class has a method that lets you register this information when a B+tree index file fname is created:

    Status add_file_entry(const char* fname, PageId header_page_num);

The header page contains the page id of the root of the tree, and every other page in the tree is accessed through the root page.

The following two figures show examples of how a valid B+ Tree might look.

Figure 1 shows what a BTreeFile with only one BTLeafPage looks like; the single leaf page is also the root. Note that there is no BTIndexPage in this case.

Figure 2 shows a tree with a few BTLeafPages, and this can easily be extended to contain multiple levels of BTIndexPages as well. The leftmost pointer of each BTIndex page is stored in the prevPage field of the underlying HFPage. Also, remember that the leaf pages are linked together in a doubly-linked list as supported by the underlying HFPage implementation.

3.3.1 IndexFile and IndexFileScan

A BTree is one particular type of index. There are other types, for example a Hash index. However, all index types have some basic functionality in common. The basic index functionality is defined in a virtual base class called IndexFile. You won't write any code for IndexFile. However, any class derived from an IndexFile should support IndexFile(), Delete(), and insert(). (IndexFile and IndexFileScan are defined in include/index.h).

Likewise, an IndexFileScan is a virtual base class that contains the basic functionality that all index file scans should support.

3.3.2 BTreeFile

The main class to implement for this assignment is BTreeFile. BTreeFile is a derived class of the IndexFile class, which means a BTreeFile is a kind of IndexFile. However, since IndexFile is a virtual base class all of the methods associated with IndexFile must be implemented for BTreeFile.

The methods to implement include:

BTreeFile::BTreeFile There are two constructors for BTreeFile (as defined in btfile.h): one that only opens an existing index, and another that creates a new index on disk, with a given type and key size. Observe that the key type is passed as a value of type AttrType. For this assignment, you only need to handle keys of type attrString and attrInteger. If there is a call with a key whose type is not one of the two, return an error. The constructor should pin the header page into the bufferpool.

BTreeFile::insert The BTreeFile::insert method takes two arguments: (a pointer to) a key and the rid of a data record. The data entry to be inserted (i.e. the `record' in the leaf pages of the BTreeFile which consists of a pair [key, rid of data record]).

If a leaf page overflows (i.e., no space for the new entry), you should split the page. Do not implement re-distribution with a sibling, instead always split if a leaf page is full. You may have to insert additional entries of the form [key, id of child page] into the higher level index pages as part of a split. Note that this could recursively go all the way up to the root, possibly resulting in a split of the root node of the B+ tree. Keep in mind the difference between a "Copy up" of an index entry from a leaf to a parent node and a "Push up" of an index entry from an internal node to a parent index node.

BTreeFile::Delete The BTreeFile::Delete routine simply removes the entry from the appropriate BTLeafPage. You are not required to implement redistribution or page merging when the number of entries falls below a threshold.

In addition, you will likely want to add some private method functions to implement a solution that uses good modular design. Remember that this is a tree data structure; you should be thinking recursive algorithms.

3.3.3 BTreeFileScan

Finally, you will implement scans that return data entries from the leaf pages of the tree. You will create the scan through a member function of BTreeFile (BtreeFile::new_scan(...) as defined in btfile.h). The parameters passed to new_scan() specify the range of keys that will be scanned in the B+ tree. They are explained in detail in btfile.h.

3.4 Errors

In the previous Minibase assignments, you learned how to use the Minibase error protocol. You should use it again in this assignment. In previous assignments, all the errors you returned belonged to one of the categories in new error.h, for example BUFMGR. In this assignment, you will need to use BTREE, BTLEAFPAGE, and BTINDEXPAGE.

4 Code Design, Testing, and Extra Credit

Make sure that your solution uses good OO design, is robust, uses Minibase error protocol, is free of memory access errors and memory leaks, and is well commented. Re-read the C++ Code Style Guide if you lost points on previous assignments for bad code design or comments. A correct program only counts for part of your grade. To receive an A, your code must also be well designed, robust, and be well commented.

Implement and test methods INDEPENDENTLY and INCREMENTALLY as much as possible. For example, first try to create a B+tree with a single leaf page. Next, try to insert some data entries into this single page B+Tree index. Next, try to handle inserting into this full single leaf page B+Tree. Next, ...

To help you with incremental testing and debugging, you may want to add a debugging method that prints out the btree's contents. Also, you will want to modify the test code, perhaps commenting out almost all of it at first and adding back more and more as you add more functionality.

I will test your solutions with more than the tests in btree_driver.C, so you should also come up with your own test code after passing these tests.

To better see the test output, I suggest re-directing your btree program's output to a file and then open the file in vi/emacs. To re-direct stdout and stderr to a file named 'output_file':

	% btree >& output_file

Extra Credit

Extra credit of up to 15 percentage points is available. Do not start on this until you have completed the basic assignment! An incomplete basic assignment with an extra credit feature will receive a lower grade than a complete basic assignment with no extra credit feature.

If you work on an extra-credit solution, you should do so in a directory separate from your solution to the basic assignment. Copy your basic solution files over to your extra credit directory as a starting point. You will then submit two versions of your code, each in a separate subdirectory:

submit/basic/: contains your solution to the basic assignment and contains no extra credit code.
submit/extracredit/: contains your solution to the extra credit part(s) of the assignment. Include tests in btree_driver.C that explicitly demonstrate that your extra credit part(s) work, and sample output showing that they work. I also will want you to demo your extra credit parts to me.

The tasks are listed below with the extra percentage points they carry in parentheses:

(5 points) Support duplicate records by allowing records with the same key to exist on more than one page. (You should not use any overflow pages to support this!)
(10 points) Implement merging and node distribution during deletion, and implement node redistribution on insert.

What to Turn In

Create a submit subdirectory of your working directory, and copy into it all the source files that I need to compile, run and test your btree solution (.C, .h, and Makefile...basically the contents of your src subdirectory). In your submit subdirectory add a README file with the following information:

Your name and your partner's name

The number of late days that you have used so far

A brief description of any features or error handling that you have not implemented in your submitted solution. In addition, if you submit an incomplete solution, it would be a good idea to tell me how to run and test your code for the parts that you did complete (you may want to submit a copy of your own btree_driver.C file (name something like btree_driver_mine.C) that contains your tests).

If you implemented an extra credit part, list which one(s) you implemented, and include information telling me how I can run your tests that show that your extra credit parts work. part.

Then create a tar file of your submit subdirectory and submit it via cs44handin before the due date.