CS35: Lab 9

Last week, you implemented a mechanism for quickly indexing and searching documents in the hopes of detecting plagiarized content. While functional, this implementation was limiting in many ways. For this lab, you will address two short comings:

The number of matches may be skewed by common phrases which can occur by chance
The results only gave the maximum match for each document. Ideally, we'd have a sorted list of all comparisons and sub-select the top few results

You will solve both of these problems using a priority queue. Your work for this lab consists of two parts:

Complete and test a templated implementation of the priority queue ADT, BinaryHeap.
Extend your work from Lab 08 to support more sophisticated and useful plagiarism detection algorithm using your priority queue.

In the next section we set up your lab directory, including transferring over your tree implementations (optional). Then, we introduce the BinaryHeap implementation. Lastly, we describe your main task (the cheatDetector.cpp program) and several specific tools you'll need to complete it.

Getting Started

As usual, you can get the initial files for this lab by cloning from Github. The new files are listed here, and the files that you will need to edit are highlighted in blue:

priorityQueue.h: Defines PriorityQueue, a pure virtual interface for a minimum priority queue.
binaryHeap.h/-inl.h: Declares and defines the BinaryHeap implementation of the PriorityQueue.
testPQ.cpp: A program to test your implementation of the BinaryHeap class.
compareEngine.h/.cpp - Implementation of CompareEngine, a wrapper object for handling and organizing the process of comparing essays and reporting results.
cheatDetector.cpp - a simple, well-designed version of last week's lab that utilizes the CompareEngine.
Makefile - you should not need to update this unless you add additional -inl.h files for the BinaryHeap class
library/ - contains auxiliary data structures. Again, I've provided some, and you're also free to use your own.
inputs/ - files to use for you main program
data - link to directory containing essays

Implementing a BinaryHeap

You should start by implementing and testing the BinaryHeap class. Be sure to map out a memory diagram of an array representation of the tree. When implementing functions, ask what your data looks like. If there is a child for a current position, what index is it at? How do I know if a index is a leaf, or specifcally a child? HINT: You don't need to actually look at the data in the nodes to figure this out.

Unlike previous labs, you will need to define the entire class from scratch. This includes:

Data members for your array representation: While this is up to you, be sure not to add fields that do not belong at the scope of the class definition
Public methods: hmm, is there a given file that explicitly defines this for you? You should not add any public methods that aren't defined in the interface already
Private methods: you should encapsulate helper methods here. You may use as many as you want. Look to your class notes for helper methods (e.g., getParent, bubbleUp).

We have given you a binaryHeap.h file and a binaryHeap-inl.h file to implement your solutions. Feel free to add a binaryHeap-private-inl.h to further subdivide the problem.

The main data in the binary heap should be represented as an array of priority-value Pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. Note that these are not pointers to Pair values, so you can't delete or allocate them for memory mangagement. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class (for example, the way ArrayQueue manages the pairs when constructing traversal queues).

You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way that uses each private function. Your testPQ.cpp file will be analyzed when grading, and you will be required to show your testing strategy when receiving help from a ninja or myself.

Finally, note that you must implement a mechanism for expanding the capacity of the underlying array. You are required to use an amortized O(1) method. Specifically, you should have a non-zero initial capacity (e.g., 10) and then multiply the capacity every time you run out of room (e.g., double).

The CompareEngine and your main cheatDetector program

The purpose of the CompareEngine class is to encapsulate all the functionality of document indexing and comparison. In many ways its interface is similar to the main cheatDetector program from Lab 8. In fact, if you had a good design for Lab 8, translating between the two labs will require a minimal amount of effort.

Main program

Your main program has a reduced role. Specifically, it should:

Parse the 3 command-line arguments
Create an instance of a CompareEngine
Load all documents into the engine
Remove common phrases from those documents
Compare all documents against every other document, storing the match scores within the CompareEngine. Be sure to only compare each pair of documents exactly once.
Report results
Output statistics

Most of those are calls to public methods of the CompareEngine. As a rough idea, our solution has a main program that is just 25 lines long and does not have any functions outside of main.

CompareEngine Requirements

We have provided a portion of the CompareEngine interface. Your job is to complete the CompareEngine, using your Lab 09 cheatDetector design as guidance. At a minimum, your CompareEngine must include:

Private data variables Any information that is needed throughout the engine's work should be declared here. For example, you will probably need to keep track of all essay file names and their BST indexes. You'll also, at a minimum, need to store all comparison scores.
A constructor that initializes all private data variables, taking in whether the engine uses AVL Tree or not and also the number of words in a phrase.
A destructor that cleans up all internal memory
loadDocuments which reads in all essays specified in filename, indexing them using a BST. This should be done using the same requirements as Lab 8.
compareTwoDocs which takes in the string file names of two documents. Note that these documents should already be indexed and stored in the CompareEngine. Use the file names to find the correct BSTs, and then compare the overlap in phrases as you did in Lab 9. Instead of returning the score, you should store it in a priority queue that is maintained as a class variable.
removeCommonPhrases (details below) which accumulates the total usage of all phrases. Any phrase that occurs at a rate higher than the given threshold should be removed from all trees.
loadOneDoc, a helper method that loads the essay specified by filename and indexes it using an appropriate BST implementation. You can change this functions declaration if you see fit.
getFilenames, returns a list of all the files indexed in the engine. This can help obtain the files to compare against one another in the main function. You should not return access to any lists maintained as part of the class; be sure to copy the items to a new list instead.
You will need add additional methods based on your design.
One last requirement: to ensure an efficient design and use of polymorphism, you should use the useAVL variable only twice: once when choosing whether to construct a LinkedBST or AVLTree, and once when reporting output.

Removing common phrases

A common problem in parsing natural language is accounting for common terms. The occurrence of commonly-used phrases in two documents is likely to lead to false-positives (i.e., flagged cases of plagiarism that arose by chance). To account for this, we will use a basic algorithm to remove common phrases from documents (or, "prune our trees" if you will).

Your algorithm (implemented in removeCommonPhrases()) should accomplish the following:

For each phrase, count how many documents contain that phrase ( not the total sum of counts; each document either has it or doesn't)
Accumulate these phrases and their counts into a priority queue
If the phrase occurs more often than the threshold (a fraction between 0 and 1), remove it from all documents.
Output the phrases removed as well as how many documents it occurred in.
The output and removal should be done in the order of occurrences (i.e., the most common phrases printed first

Play around with what threshold makes sense. In practice, we would want this parameter to remove common phrases while leaving all unique phrases. For submission, you can simply set this to 0.10 in your call from main (e.g., remove phrases that occur in 10% of documents).

Sorted result reporting

In all, you will conduct roughly N^2/2 document comparisons where we have N documents. Rather than output the maximum match for each document as in Lab 8, you will report the top N scores overall, across all pairs of documents. This means that some documents won't appear in the results at all, while others may appear multiple times. Your output should be in sorted order. So, for example, if you compare 1000 documents against 1000 documents, you will perform roughly 500,000 comparison and report the top 1000 overall scores.

Tips and Sample Output

Using a min priority queue

Note that you want to sort by most common phrases. But we implemented a min priority queue, what do we do? Well, what happens if you negate all the numeric values? A min priority queue prioritizes the most negative values. Thus, when you removeMin , you are getting the negative of the maximum value. Voila, you have simulated a max priority queue! You can use this trick to sort by most number of matches between essays.

Using printf

To create formatted output, you can use a C routine known as printf. Read this tutorial to learn about its usage in detail. Your result does not need to match ours exactly, but each field should line up nicely.

At a high level, it works like print formatting in Python. As an example, if I want to print someone's name and age I could do the following:

string name1 = "Joshua", name2 = "Andy";
int age1 = 32, age2 = 26;
printf("%-8s is %3d years old", name1.c_str(), age1);
printf("%-8s is %3d years old", name2.c_str(), age2);

This results in:

Joshua    is  32 years old
Andy      is  26 years old

%s is a template for strings, %d for integers. %8s means the string has 8 spaces to fill. If the string is shorter, it will add white space to the left. %-8s is similar, with the white space being added to the right.

Timing your program

Unix has a built in command called time that can let you know how long your program takes to run. To get a better feel for how using a LinkedBST and an AVLTree differ in terms of actual run-time, we can use this is a method for experimentally testing our options.

To use time, simply add the word time before the command you wish to run. For example:

time ./cheatDetector inputs/bigList.txt 3 0

You will see your normal output, and at the end three statistics will pop out:

real  17m41.461s
user  17m29.498s
sys 0m10.989s

Those provide the total wall time, followed by the division of that time into how much was used by the actual program versus system calls (like accessing the filesystem).

What run times do you get when using a LinkedBST vs AVLTree? You should see a significant speed-up. Report this in your README file. You should think about how this difference compares to our worst case analysis. What are factors that mitigate the difference? What variables hide the true speed up we see between using the two types of trees?

Sample Output

(Note that you may have small deviations from this output, if the differences are small, it won't affect your grade.)

smallList, phrase size 3, no AVL Tree: result
mediumList, phrase size 3, no AVL Tree: result
mediumList, phrase size 5, AVL Tree: result
bigList, phrase size 3, AVL Tree: result
bigList, phrase size 6, AVL Tree: result

Take a look at the results on the big test. How many of those results seem worth investigating? One of the papers is called catchmeifyoucan.txt, a planted case of cheating. They seemed to have cheated a little from a lot of papers. We may not have noticed this in last week's results due to our strategy for reporting. What are the shortcomings of our approach? What are some improvements we can make? Think both about how we could improve finding common phrases as well measuring overlaps. Note that your results may swap the ordering of documents. But you should ensure that you only compare two documents once.

Submitting your work

As usual, submit with git push.

CS35 Lab 9: Prioritizing Plagiarism Cases