CS35 Lab 8: Plagiarism Detector, AVL Trees

Due by 11:59 p.m., Sunday, November 9, 2014

This lab has two main parts: you will write a plagiarism detector, and you'll implement AVLTrees. Your overall tasks are:

  1. Implement and test a plagiarism detector that uses BSTs.
  2. Create and test an AVLTree implementation.
  3. Modify your plagiarism detector to handle both linkedBSTs and AVLTrees.
  4. Answer a few short questions in your README file.
  5. Optionally complete the extra credit portion of the lab (see below.)
Since the work involved in the first two are largely separate, we've divided this writeup accordingly. Don't worry---implementing the AVLTree requires little code besides what you've already done for lab 07, and if you design your plagiarism detector well, modifying it to handle AVLTrees should require very little extra work.


Sample Runs

Sample runs:


Plagiarism Detector Introduction

This week's lab will further the development of a plagiarism detection algorithm. Given a set of documents your algorithm will identify significant matches between pairs of documents that indicate shared use of the same material. Specifically, we will test your program on a set of essays submitted by high school students. You can peruse the data set here in

/home/brody/public/cs35/cheatDetector/data/
Beware, the writing quality may not be up to Swarthmore's high standards.

Your main program will be implemented in cheatDetector.cpp. For each document you will report the essay with largest number of hits (i.e., matching phrases). The design of your main program will be largely left up to you, but we have broken down the requirements into the following parts:

  1. Starting files: your starting files are a mixture of reusing lab 7 material and obtaining new files.
  2. Input: Your program will begin by obtaining inputs. However, you will not perform any user input (i.e., cin). Instead, you will use command-line arguments, discussed below.
  3. Load one document: your program will create a searchable version of one essay by representing all of its phrases in a binary search tree.
  4. Storing all documents: once you have determined how to load one document, you will need to determine how to keep track of all documents in a data structure.
  5. Detect top matches: for each essay, you will compare against all other essays by utilizing your binary search tree representation. Each essay will report the highest match score.
  6. Output statistics: in part B you will implement an AVL Tree. To compare the performance of a balanced vs unbalanced binary search tree, you will report the heights of your trees.

Make sure your design is easily modified to use either a LinkedBST or an AVLTree to represent a document. Also, do not implement the AVLTree until you have finished your cheatDetector. While you have not implemented the AVLTree, a good design will make it very easy to incorporate this second BST implementation. The user will specify which they want to use at run-time. If you understand polymorphism, this is pretty simple to accomplish.


Getting Started with the Plagiarism Detector

TODO:

First, run update35 to create your labs/08 directory and obtain the central file cheatDetector.cpp. Note: when you run update35, you might find an error message like

$ chmod: changing permissions of '/home/jbas/cs35//labs/08/data': Operation not permitted
You can safely ignore this. You'll also get test files in the input directory and a Makefile that accounts for the AVLTree. You'll also want to copy over the files from Lab 07 for use this week (except the Makefile and the README file). To do so, follow this script:
$ cd ~/cs35/labs/08/
$ cp -r ../07/* .
cp: overwrite `./Makefile'? no
cp: overwrite `./README'? no
This will let you use your LinkedBST implementation this week. Note that you should be sure to fix any errors in that implementation otherwise your main program may not work properly. Once you start adding in elements for the AVLTree you'll have to modify the Makefile; you'll find instructions in the AVLTree portion of this writeup.


Input via the Command Line

TODO:

To avoid user-interaction during run-time, a common too for obtaining input is by using command-line arguments. These are values given when executing from the command-line. To illustrate by example, our usual execution without command-line arguments:

$ ./cheatDetector
An example run for this lab would instead be:
$ ./cheatDetector inputs/smallList.txt 3 0

To process command-line arguments in C++ you need to add two arguments to your main function:

  int main(int argc, char* argv[]) 
argc is an int representing the number of arguments entered (including the program name itself), and argv is an array of character arrays, with each character array containing one argument to the program. You can then use argc and argv inside your program to access the command-line arguments. For example, in the run above, argc has the value 4 meaning that argv is an array with four c-string values. argv[0] is always the program being run ("cheatDetector"), argv[1] is "inputs/smallList.txt", argv[2] is "3" and argv[3] is "0". argc is useful to check if the user entered the correct number of arguments. For example, if the user forgets the arguments you can print an error:
$ ./cheatDetector 
Incorrect number of arguments
Usage: cheatDetector file-list phrase-size useAVL

We have provided the central set up already; the provided code converts argv[2] to the phraseSize (using atoi which converts c-strings to ints) and argv[3] to the boolean useAVL. You are responsible for obtaining the name of the file listing essays. Remember, argv is a c-string array, so if you want to store the file name as a string, use the string constructor. For example, if we wanted to store the third parameter as a string, we'd use string useAVL(argv[3]);


Storing One Document as a BST

TODO:

One naive solution to this lab would be to compare all phrases in one document against all phrases in all other documents. That can be extremely costly (O(w^2) for w words in the corpus!). Instead, we will index each essay into a binary search tree to quickly query phrases. The next section will discuss how to load all documents. For now, assume you are attempting to load one document, e.g., data/tyc6.txt. To do so, first check that the file exists, exiting if it does not.

Given the file exists, you will index the document by loading all phrases into a binary search tree. That is, your key will be an n-word phrase string (where n is specified on the command-line) and the value is the number of times that phrase occurs in the document. You should overlap all overlapping phrases. For example, for two word phrases, the sentence: "Computer Science 35 rules" leads to the 3 following keys being inserted into the BST: "Computer Science","Science 35","35 rules" .

In addition to allowing phrases of different size, you should determine whether to use a LinkedBST or a AVLTree depending on the boolean value useAVL given as a command-line argument.


Storing All Documents

The provided command-line argument is a list of the locations of all essays. Each line specifies the location of one document. You can read the entire line in as one long string.

For each document, (1) check that the file exists, (2) load the document into its own tree (see above) and (3) store the BST into some meta-data structure. That last part is up to you to figure out. One way to accomplish this is...with another tree! For instance, you could use each file name as a key, and a pointer to the essay's phrase index tree as the value. This might have the type AVLTree< string, AVLTree< string,int >* >. (Of course, it would be nice if this handled any kind of BST. What would that type look like?) There are other reasonable options if this data type is scary.


Detecting Matches

At this point, your well-designed program has produced a set of search-trees, each indexing all of the phrases within one essay. Now, we want to detect pairs of essays with significant overlap in content. To do this, we make the following assumption: essays guilty of plagiarism will have significant overlap in the phrases used. Your program will use this assumption to detect potential cheaters. You will use the following pseudocode:
For each essay i:
  For each other essay j:
    Count the number of matches between i and j
    If this is the most matches to i, save the result
  Output i and the j with maximum overlap
You'll also want to report the size of the overlap for the optimal hit.

For counting matches, you'll want to take every phrase in essay i and determine if it is in essay j's tree. For every match, increase the total number of matches for that pair.


Outputting Statistics

After your program has completed finding and outputting the largest overlap for each essay, you should output three statistics to give us an understanding of the efficiency difference between AVLTrees and LinkedBSTs.


AVLTree Introduction
In the second part of the lab, you should provide an implementation of the AVLTrees we've been discussing in class. While implementing AVLTree sounds lengthy, 95% was already done in last week's lab. All non-insert and remove methods work the same when comparing a LinkedBST to AVLTree, and we've seen pseudocode for most of the other functions. Below, I will detail the modifications needed to maintain the balanced AVL property. At a high level, you will need to accomplish the following:
  1. Initial setup - get and create files to for your AVLTree implementation
  2. Modify insert/remove - modify insertInSubtree and removeFromSubtree to balance
  3. Implement balancing and rotations - store node heights, detect imbalances, and implement rotations
  4. Test your AVL Tree - add tests for balancing to testBST

Once you've completed your AVL Tree and tested it, you will extend your cheatDetector implementation by adding an AVLTree option to the main program.

The AVLTree is very similar to the LinkedBST from Lab 07, except it maintains the AVL Tree Property to guarantee that the tree is balanced. (The AVL Tree Property is that, for each node in the tree, the height of that node's left and right subtrees differs by at most one.)

To efficiently maintain the AVL property, each AVLTreeNode stores its current height in addition to the other data stored by a LinkedBSTNode. In sum, each AVLTreeNode contains:

Our algorithms define an empty child node to have a height of -1 for simplicity, but an empty child node is not an actual node. It is the pointer NULL. You will need to be careful to check that children nodes are not NULL before attempting to access information such as a child node's height or children.

To maintain a roughly O(lg n) tree height, you will extend your LinkedBST implementation to detect imbalances after insertions and removes and then use one of four tree rotation methods to fix the imbalance.


Getting Started on the AVLTree

TODO:

You should have run update35 already. This should add the following files:

Neither of these files needs modifications. Instead, all of you work will be done in AVLTree-private-inl.h. Since the vast majority of methods will be the same as your LinkedBST private methods, you'll use your linkedBST-private-inl.h file as a starting point. To do so, do the following:
  1. Copy your linkedBST-private-inl.h file:
    $ cp linkedBST-private-inl.h AVLTree-private-inl.h
    
  2. Replace all instances of BSTNode with AVLTreeNode and LinkedBST with AVLTree. This is done using search and replace. Each editor has a built in way of doing this, but we can do it easily with a linux program called sed:
    $ sed -i 's/BSTNode/AVLTreeNode/g' AVLTree-private-inl.h 
    $ sed -i 's/LinkedBST/AVLTree/g' AVLTree-private-inl.h 
    

Since you are using your LinkedBST as your starting point, you are responsible for ensuring you correct all errors from Lab 7. Any lingering errors from Lab 7 will lead to further deductions in Lab 8.

Lastly, open up your Makefile and remove the comment # on line 8 so that the AVLTree files will be included in compilation.


Inserts/Removes in AVLTree

TODO:

To insert in an element an AVL Tree, we follow the normal protocol. As we saw in class, the insertInSubtree method should handle the 4 cases (i.e., base case, duplicate key, smaller key, larger key) for recursion the same as LinkedBST. Once that work is complete, current is the root of a sub-tree that may now be unbalanced and require a rotation. This check must be done at every call since the imbalance could be detected at any step back up the tree. Your overall method should follow this protocol:

  1. Handle the 4 cases the same as LinkedBST
  2. Before returning:
    1. Calculate height of node
    2. Check for imbalances
    3. Fix imbalances with rotations
  3. Step 2 may have resulted in a new root for the sub-tree. Return the root of the sub-tree to the calling function
Implementing this should take 1 to 2 lines of code at most. You completed #1 already from Lab 7. #2 should be handled by another function (balance()) specified below. #3 should take the value returned by balance and return it.

removeFromSubtree follows the same outline. Before every return statement, you should call balance to handle all of the work of updating heights and checking for imbalances.


Balancing and rotations

TODO:

To complete the AVLTree, you should complete each of the rotation functions discussed in class. Their prototypes can be found in AVLTree.h file. There is one rotation function for each of the four cases discussed in class, and each returns the new root of the sub-tree to the calling function:

Each rotation should do the following:

These functions are called by the balance function (which is in turn called by insertInSubtree and removeFromSubtree). balance does much of the book-keeping to maintain the AVL property. To complete this method, you should take the following steps:

  1. If current is NULL, return current. There is no work to do.
  2. Update current's height. You should implement this routine in computeHeightFromChildren. Remember to account for potential empty children (NULL).
  3. Check for a left imbalance (left child is 2 taller than right)
    • If so, check which of left's children is taller. Call either rightRotate (left grandchild is bigger) or leftRightRotate (right grandchild is bigger)
  4. Check for a right imbalance (right child is 2 taller than left)
    • If so, check which of right's children is taller. Call either leftRotate (right grandchild is bigger) or rightLeftRotate (left grandchild is bigger)
  5. Return the root of the sub-tree. Note that it may have changed.


Unit-testing the AVLTree

You should test the AVLTree implementation using the testBST program from last week. This should stress test the AVLTree implementation for a large number of random insertions and deletions. In addition, your tests should confirm that each of your rotation functions behaves as expected for known examples of the tree. While you cannot access private methods, you can come up with specific examples that guarantee certain rotations will occur and then use public methods such getHeight() or getPreOrder() to inspect the contents of the tree.

Note: Ideally, your new test functions shouldn't need to mention AVLTrees at all beyond creating the AVLTree object --- you should be able to run your tests on linkedBSTs as well as AVLTree. We are not making this a requirement for this lab however. In particular, your tests are free to assume they're working with an AVLTree instead of, say, a BST*. This will let you use isBalanced. You're also free to design your tests to specifically test your implementation of an AVLTree; you are allowed (and encouraged) to make tests that ensure rotations happened exactly when you expect them to and in exactly the way you expect them to.

Extending cheatDetector

Now, return to the first part of this lab, and ensure that your cheatDetector can work with either LinkedBSTs or AVLTrees. You should allow the user of your program to choose which BST implementation to use via command-line arguments. If you have designed your cheatDetector well, this should take very few (less than five) lines of code. Once that is complete, run your program with both and pay attention to the performance. How do the heights of trees change by using balancing? You will need to answer a few questions for the README.



README questions

README

Answer the following questions the README file in your directory:
  1. Your destructor is implemented using a helper function traverseAndDelete. What type of traversal algorithm is this method using? Why did we choose this type of traversal as opposed to any of the others?

  2. How do the heights of trees change by using balancing? How did this affect the running time of your cheatDetector?

  3. For the following applications,which of the 4 traversal algorithms in class would best provide a solution (i.e., either Level, Pre-Order, Post-Order, or In-Order). You do not need to explain.


  4. Given the hierarchical outline for a book (in tree format), print the Table of Contents. For example, the root of the tree would be the title, all Level 1 nodes are the chapters, all Level 2 nodes are the sections of the respective chapters. A Table of Contents prints a chapter followed by all of its sections before going to the next chapter and doing the same.

  5. For each directory on a disk, calculate the amount of disk usage. For, example, running the du command on your labs directory produces:
    $ du -h cs35/labs
    6.0M	cs35/labs/01/input
    6.1M	cs35/labs/01
    20K	cs35/labs/04/input
    16K	cs35/labs/04/dict
    76K	cs35/labs/04
    36K	cs35/labs/02/input
    28K	cs35/labs/02/settings
    108K	cs35/labs/02
    4.0K	cs35/labs/00
    32K	cs35/labs/03
    48K	cs35/labs/05/06/quickSort
    56K	cs35/labs/05/06
    16K	cs35/labs/05/inputs
    140K	cs35/labs/05
    28K	cs35/labs/07/library
    76K	cs35/labs/07
    6.5M	cs35/labs
    
  6. Any mathematical formula can be set up as a parse tree, where the internal nodes are operators (e.g., *,+,-,/) and the leaves are the operands. For example, the formula (3*5+1)*2 is the tree:
            *
          /   \
         +      2
        / \
       *   1
      / \
     3   5
    
    What traversal would produce the correct order of operations?


Optional Extra Credit

At this point, you have seen two different BST implementations. Your task for the extra credit is to design your own BST implementation that optimizes the overall runtime of your cheatDetector program. Can you create a balanced BST that has lower average height than your AVLTree? How does the actual running time compare? Can you create a balanced BST implementation with higher average height, but lower overall running time?

Some things to keep in mind:

The metrics for comparison are intentionally vague. At a minimum, you have a nontrivial program (cheatDetector) that uses BSTs. cheatDetector runs for long enough that it's worth trying to optimize runtime. One way to try to do that might be to pay close attention to the average height of the tree, but there are other ways.

When you hand in your lab, you should also add a section to the README file stating that you're doing the extra credit portion. You should also have a brief (1-2 paragraphs) description of how your BST implementation differs from the two we've seen in class.

This kind of extra credit is not directly proportional to points earned in labs. I might use these extra credit points to give students points back for a missed lab, or perhaps for a small bump in your participation grade. The primary purpose is to give a challenge to students who are looking to go beyond what is normally required in a lab. Do not even begin to attempt the extra credit until you have completed the rest of this lab. The gain you get in terms of your final grade is not proportional to what you get by putting effort into the main part of the lab.

If enough students request it, we will try to include more optional extra credit items on future labs.


Submitting your lab
Submit using handin35. Remember to indicate your parter using the 'p' option.