CS35 Lab 9: Prioritizing Plagiarism Cases

Due by 11:59 p.m., Sunday November 15

Last week, you implemented a mechanism for quickly indexing and searching documents in the hopes of detecting plagiarized content. While functional, this implementation was limiting in many ways. For this lab, you will address two short comings:

You will solve both of these problems using a priority queue. Your work for this lab consists of two parts:
  1. Complete and test a templated implementation of the priority queue ADT, BinaryHeap.
  2. Extend your work from Lab 08 to support more sophisticated and useful plagiarism detection algorithm using your priority queue.

In the next section we set up your lab directory, including transferring over your tree implementations (optional). Then, we introduce the BinaryHeap implementation. Lastly, we describe your main task (the cheatDetector.cpp program) and several specific tools you'll need to complete it.

Getting Started

As usual, you can get the initial files for this lab by cloning from Github. The new files are listed here, and the files that you will need to edit are highlighted in blue:

Implementing a BinaryHeap

You should start by implementing and testing the BinaryHeap class. Be sure to map out a memory diagram of an array representation of the tree. When implementing functions, ask what your data looks like. If there is a child for a current position, what index is it at? How do I know if a index is a leaf, or specifcally a child? HINT: You don't need to actually look at the data in the nodes to figure this out.

Unlike previous labs, you will need to define the entire class from scratch. This includes:

We have given you a binaryHeap.h file and a binaryHeap-inl.h file to implement your solutions. Feel free to add a binaryHeap-private-inl.h to further subdivide the problem.

The main data in the binary heap should be represented as an array of priority-value Pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. Note that these are not pointers to Pair values, so you can't delete or allocate them for memory mangagement. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class (for example, the way ArrayQueue manages the pairs when constructing traversal queues).

You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way that uses each private function. Your testPQ.cpp file will be analyzed when grading, and you will be required to show your testing strategy when receiving help from a ninja or myself.

Finally, note that you must implement a mechanism for expanding the capacity of the underlying array. You are required to use an amortized O(1) method. Specifically, you should have a non-zero initial capacity (e.g., 10) and then multiply the capacity every time you run out of room (e.g., double).

The CompareEngine and your main cheatDetector program

The purpose of the CompareEngine class is to encapsulate all the functionality of document indexing and comparison. In many ways its interface is similar to the main cheatDetector program from Lab 8. In fact, if you had a good design for Lab 8, translating between the two labs will require a minimal amount of effort.

Main program

Your main program has a reduced role. Specifically, it should:

  1. Parse the 3 command-line arguments
  2. Create an instance of a CompareEngine
  3. Load all documents into the engine
  4. Remove common phrases from those documents
  5. Compare all documents against every other document, storing the match scores within the CompareEngine. Be sure to only compare each pair of documents exactly once.
  6. Report results
  7. Output statistics
Most of those are calls to public methods of the CompareEngine. As a rough idea, our solution has a main program that is just 25 lines long and does not have any functions outside of main.

CompareEngine Requirements

We have provided a portion of the CompareEngine interface. Your job is to complete the CompareEngine, using your Lab 09 cheatDetector design as guidance. At a minimum, your CompareEngine must include:

Removing common phrases

A common problem in parsing natural language is accounting for common terms. The occurrence of commonly-used phrases in two documents is likely to lead to false-positives (i.e., flagged cases of plagiarism that arose by chance). To account for this, we will use a basic algorithm to remove common phrases from documents (or, "prune our trees" if you will).

Your algorithm (implemented in removeCommonPhrases()) should accomplish the following:

Play around with what threshold makes sense. In practice, we would want this parameter to remove common phrases while leaving all unique phrases. For submission, you can simply set this to 0.10 in your call from main (e.g., remove phrases that occur in 10% of documents).

Sorted result reporting

In all, you will conduct roughly N^2/2 document comparisons where we have N documents. Rather than output the maximum match for each document as in Lab 8, you will report the top N scores overall, across all pairs of documents. This means that some documents won't appear in the results at all, while others may appear multiple times. Your output should be in sorted order. So, for example, if you compare 1000 documents against 1000 documents, you will perform roughly 500,000 comparison and report the top 1000 overall scores.

Tips and Sample Output

Using a min priority queue

Note that you want to sort by most common phrases. But we implemented a min priority queue, what do we do? Well, what happens if you negate all the numeric values? A min priority queue prioritizes the most negative values. Thus, when you removeMin , you are getting the negative of the maximum value. Voila, you have simulated a max priority queue! You can use this trick to sort by most number of matches between essays.

Using printf

To create formatted output, you can use a C routine known as printf. Read this tutorial to learn about its usage in detail. Your result does not need to match ours exactly, but each field should line up nicely.

At a high level, it works like print formatting in Python. As an example, if I want to print someone's name and age I could do the following:

string name1 = "Joshua", name2 = "Andy";
int age1 = 32, age2 = 26;
printf("%-8s is %3d years old", name1.c_str(), age1);
printf("%-8s is %3d years old", name2.c_str(), age2);
This results in:
Joshua    is  32 years old
Andy      is  26 years old
%s is a template for strings, %d for integers. %8s means the string has 8 spaces to fill. If the string is shorter, it will add white space to the left. %-8s is similar, with the white space being added to the right.

Timing your program

Unix has a built in command called time that can let you know how long your program takes to run. To get a better feel for how using a LinkedBST and an AVLTree differ in terms of actual run-time, we can use this is a method for experimentally testing our options.

To use time, simply add the word time before the command you wish to run. For example:

time ./cheatDetector inputs/bigList.txt 3 0
You will see your normal output, and at the end three statistics will pop out:
real  17m41.461s
user  17m29.498s
sys 0m10.989s
Those provide the total wall time, followed by the division of that time into how much was used by the actual program versus system calls (like accessing the filesystem).

What run times do you get when using a LinkedBST vs AVLTree? You should see a significant speed-up. Report this in your README file. You should think about how this difference compares to our worst case analysis. What are factors that mitigate the difference? What variables hide the true speed up we see between using the two types of trees?

Sample Output

(Note that you may have small deviations from this output, if the differences are small, it won't affect your grade.)

Take a look at the results on the big test. How many of those results seem worth investigating? One of the papers is called catchmeifyoucan.txt, a planted case of cheating. They seemed to have cheated a little from a lot of papers. We may not have noticed this in last week's results due to our strategy for reporting. What are the shortcomings of our approach? What are some improvements we can make? Think both about how we could improve finding common phrases as well measuring overlaps. Note that your results may swap the ordering of documents. But you should ensure that you only compare two documents once.

Submitting your work

As usual, submit with git push.