CS44 Fall 2007

Project 5: Sort Merge Join

Due Wednesday Dec. 12 before 1am (late Tues night)

You may only use 1 late day on this assignment (due before 1am on Friday Dec. 14)

Introduction

In this assignment, you will implement the sort-merge join algorithm. You should begin by reviewing the chapter on Implementation of Relational Operations, in particular, the section on Sort-Merge Join.

What You Have to Implement

class sortMerge {
  public:

    sortMerge(
      char       *filename1,    // Name of heapfile for relation R.

      int         len_in1,      // # of columns in R.

      AttrType    in1[],        // Array containing field types of R.

      short       t1_str_sizes[], // Array containing size of columns in R.

      int         join_col_in1, // The join column number of R.

      char       *filename2,    // Name of heapfile for relation S

      int         len_in2,      // # of columns in S.

      AttrType    in2[],        // Array containing field types of S.

      short       t2_str_sizes[], // Array containing size of columns in S.

      int         join_col_in2, // The join column number of S.

      char*       filename3,    // Name of heapfile for merged results

      int         amt_of_mem,   // Number of pages available for sorting

      TupleOrder  order,        // Sorting order: Ascending or Descending

      Status&     s             // Status of constructor

    );
   ~sortMerge();
};
The sortMerge constructor joins two relations R and S, represented by the heapfiles filename1 and filename2, respectively, using the sort-merge join algorithm. Note that the columns for relation R (S) are numbered from 0 to len_in1 - 1 (len_in2 - 1). You are to concatenate each matching pair of (r, s) records and write it into the heapfile filename3. Remember that the join column of relation S should not be repeated in the result. For example, if R's schema is {rnum:int, rstr:string} and S's schema is {snum:int, sstr:string} and S and R are joined on column 0 (the int attributes), then the join result schema will be {rnum:int, rstr:string, sstr:string}. If R and S are joined on column 1 (the string attributes), then the result schema is {rnum:int, rstr:string, snum:int}. The error layer for the sortMerge class is JOINS.

You will need to use the following classes which are given: Sort, HeapFile, and Scan. You will call the Sort constructor to sort the input heapfiles (which means your primary responsibility will be to implement the merging phase of the algorithm). Once a scan is opened on a heapfile, the scan cursor can be positioned to any record within the heapfile calling the Scan method position with an RID argument. The next call to the Scan method getNext will proceed from the new cursor position.

Where to Find Makefiles, Code, etc.

Copy the starting point code from ~newhall/public/cs44/proj5. The files are: Interface files are in include subdirectory. You should test your algorithm for joining on both columns of the relations. To do this, change the value of JOIN_COL in SMJTester.C and re-run the tests:
// set JOIN_COL to 1 to test joining relations on 2nd column (string field)
// set JOIN_COL to 0 to test joinging relations on 1st column (int field)
#define JOIN_COL          0
Output from the test runs will help you to determine if your join algorithm is correct. Here is an example of the result of doing a sort-merge join of the listed R and S on the string attribute:
R: 
---
2       bbb
2       eee
2       fff
13      sss
77      aaa
5       rrr
11      ccc
2       bbb

S: 
---
99      ooo
13      www
10      fff
12      rrr
8       xxx
10      sss
11      jjj
13      yyy
20      bbb
13      lll
66      ddd

R join S: 
----------
2       bbb     20
2       bbb     20
2       fff     10
5       rrr     12
13      sss     10

What to Turn In

Create a submit subdirectory of your working directory, and copy into it all the source files that I need to compile, run and test your solution (.C, .h, and Makefile...basically the contents of your src subdirectory). In your submit subdirectory add a README file with the following information:
  1. Your name and your partner's name
  2. The number of late days that you have used so far
  3. A brief description of any features or error handling that you have not implemented in your submitted solution. In addition, if you submit an incomplete solution, it would be a good idea to tell me how to run and test your code for the parts that you did complete (you may want to submit a copy of your own SMJTester.C file)
Create a tar file of your submit subdirectory and submit it via cs44handin before the due date.