CS35 Lab 7: Plagiarism Detector

Due by 11:59 p.m., Sunday, November 3, 2012

Over the next few weeks, you will implement a multi-part lab that will create a plagiarism detector. Given a set of documents (e.g., essays from a high school English class), your algorithm will identify significant matches between pairs of documents that indicate shared use of the same material. Some of the documents may be source documents (e.g., a Wikipedia entry) while others may be actual essays.

This week, you will concentrate on the following:

  1. Complete a templated implementation of a binary search tree, LinkedBST.
  2. Thoroughly test your implementation, creating a robust series of tests that will be applied to every other group's submission.

As usual, you can get the initial files for this lab by running update35. When you run update35 you will obtain a large number of files for this lab. The files that you will need to edit are highlighted below:

You may choose one partner for this lab. Both partners should be present and working on the code together. You will both be responsible for understanding all concepts, so dividing and conquering is not an option. The academic integrity policy applies to the entire pair; you cannot work or share code with anyone outside of your partner. Also, be sure to indicate who your partner is by including both names at the top of your files and by also selecting the 'p' option when using handin35 for the first time. Only one student in the pair should submit the lab.

Below, we introduce the two main components and then discuss some helpful tools in detail. In particular, the Pair class (which you do not need to modify), iterators, and using asserts.



Binary search tree implementation

Your main work for this lab is to complete and test the LinkedBST implementation. In linkedBST.h we have defined templated LinkedBST and BSTNode classes for you. The public methods have been implemented in linkedBST-inl.h and the helper methods are partially implemented in linkedBST-private-inl.h. Each incomplete function in linkedBST-private-inl.h is marked with a TODO comment.

Our LinkedBST itself contains just two pieces of data:

Each BSTNode contains:

As described in lecture, most of the public functions defined by the BST interface are simple calls to private recursive helper functions, which perform the required operation on some subtree of the tree.

One peculiarity of our implementation is that our helper update functions (insertInSubtree and removeFromSubtree) return a pointer to root of the subtree they modified. This pointer is usually the same node as the node passed as an argument to the function (current), but can be a new node (for insertInSubtree) or a different previously-existing node (for removeFromSubtree) if their modification changes the root node of the subtree. This peculiarity greatly simplifies the binary search tree implementation because it allows a node's parent to easily maintain a correct tree structure if the modification changes the parent's child node.

Each tree traversal function returns a pointer to an iterator of all key-value pairs in the tree, in the order specified by the tree traversal. For a LinkedBST<K,V> storing keys of type K and values of type V, each key-value pair has type Pair<K,V>. This means that a pointer to an iterator of key-value pairs has type Queue< Pair<K,V> >*, a templated type of a templated type! (This is sometimes called a nested template type.) Be warned that the compiler requires a space between the > for the inner template and the > for the outer template. If you forget the space you might get some error such as:

  testBST.cpp:42:  error: '>>' should be '> >' within a nested template 
  argument list
See the LinkedBST::getPostOrder function for an example of how to create our iterators for the tree traversals, and testBST.cpp for examples of how to use them.


Testing your implementation

Your test file(s) will be a larger portion of your grade this week. In particular, we will employ unit testing to thoroughly test our BST implementation. Your test should follow these conventions:


The Pair class

We have implemented a templated Pair class to store a pair of arbitrary-type data. You do not need to modify our implementation, but you will need to understand Pairs enough to return an iterator of key-value pairs for your tree traversals.

A Pair is essentially just a container to publicly store two pieces of data:

  template <typename F, typename S>
  class Pair {
    public:
      F first;
      S second;
      Pair(F f, S s);
  };

Because the data is publicly stored within the Pair, the data can be accessed and changed directly from any other scope of the program. For example:

  int main() {
    Pair<string,int> p("Hello", 42);
    cout << p.first << ", " << p.second << endl;
    p.first = "Bye!";
    cout << p.first << endl;
    return 0;
  }

The Pair class is useful for dictionary data structures, such as the LinkedBST, where we store a key-value pair for each piece of data.



Iterators

A common programming task is to execute some code for each item in a data structure. An iterator is an object that provides a consistent interface for traversing all data (essentially, providing a linear interpretation of a non-linear data structure).

In this program, you will use Queues as an iterator; you will enqueue key-value pairs using an appropriate traversal algorithm (e.g., in-order, pre-order, level-order, post-order). The idea is that the dequeue() function returns each item from the original tree once, and the isEmpty() function returns false until each item in the container has been returned with dequeue(). In normal implementations, a special Iterator class is used to perform these operations, but essentially it is a Queue. Iterators are widely used for many different types of data structures (such as hash tables and graphs, as we will see later).



Unit-testing with assert

assert(bool condition) is a C function that takes a boolean condition as an argument and causes your program to immediately crash if that boolean condition is false (assert does nothing if that condition is true). This is useful because it allows you to test parts of your program as you write, checking that the program values are what you expect when your program runs. To use assert you must #include <assert.h>. We used assert in previous labs, but your tests will solely rely on this function for this week.



Modifying linkedBST.h

You should not need to modify the declaration for LinkedBST. The allowed exceptions are only to create additional private methods

Submitting your work

Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.

You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.