Last week, you implemented a mechanism for quickly indexing and searching documents in the hopes of detecting plagiarized content. While functional, this implementation was limiting in many ways. For this lab, you will address two short comings:
In the next section we set up your lab directory, including transferring over your tree implementations (optional). Then, we introduce the BinaryHeap implementation. Lastly, we describe your main task (the cheatDetector.cpp program) and several specific tools you'll need to complete it.
As usual, you can get the initial files for this lab by cloning from Github. The new files are listed here, and the files that you will need to edit are highlighted in blue:
You should start by implementing and testing the BinaryHeap class. Be sure to map out a memory diagram of an array representation of the tree. When implementing functions, ask what your data looks like. If there is a child for a current position, what index is it at? How do I know if a index is a leaf, or specifcally a child? HINT: You don't need to actually look at the data in the nodes to figure this out.
Unlike previous labs, you will need to define the entire class from scratch. This includes:
The main data in the binary heap should be represented as an array of priority-value Pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. Note that these are not pointers to Pair values, so you can't delete or allocate them for memory mangagement. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class (for example, the way ArrayQueue manages the pairs when constructing traversal queues).
You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way that uses each private function. Your testPQ.cpp file will be analyzed when grading, and you will be required to show your testing strategy when receiving help from a ninja or myself.
Finally, note that you must implement a mechanism for expanding the capacity of the underlying array. You are required to use an amortized O(1) method. Specifically, you should have a non-zero initial capacity (e.g., 10) and then multiply the capacity every time you run out of room (e.g., double).
The purpose of the CompareEngine class is to encapsulate all the functionality of document indexing and comparison. In many ways its interface is similar to the main cheatDetector program from Lab 8. In fact, if you had a good design for Lab 8, translating between the two labs will require a minimal amount of effort.
Your main program has a reduced role. Specifically, it should:
We have provided a portion of the CompareEngine interface. Your job is to complete the CompareEngine, using your Lab 09 cheatDetector design as guidance. At a minimum, your CompareEngine must include:
A common problem in parsing natural language is accounting for common terms. The occurrence of commonly-used phrases in two documents is likely to lead to false-positives (i.e., flagged cases of plagiarism that arose by chance). To account for this, we will use a basic algorithm to remove common phrases from documents (or, "prune our trees" if you will).
Your algorithm (implemented in removeCommonPhrases()) should accomplish the following:
In all, you will conduct roughly N^2/2 document comparisons where we have N documents. Rather than output the maximum match for each document as in Lab 8, you will report the top N scores overall, across all pairs of documents. This means that some documents won't appear in the results at all, while others may appear multiple times. Your output should be in sorted order. So, for example, if you compare 1000 documents against 1000 documents, you will perform roughly 500,000 comparison and report the top 1000 overall scores.
Note that you want to sort by most common phrases. But we implemented a min priority queue, what do we do? Well, what happens if you negate all the numeric values? A min priority queue prioritizes the most negative values. Thus, when you removeMin , you are getting the negative of the maximum value. Voila, you have simulated a max priority queue! You can use this trick to sort by most number of matches between essays.
To create formatted output, you can use a C routine known as printf. Read this tutorial to learn about its usage in detail. Your result does not need to match ours exactly, but each field should line up nicely.
At a high level, it works like print formatting in Python. As an example, if I want to print someone's name and age I could do the following:
string name1 = "Joshua", name2 = "Andy"; int age1 = 32, age2 = 26; printf("%-8s is %3d years old", name1.c_str(), age1); printf("%-8s is %3d years old", name2.c_str(), age2);This results in:
Joshua is 32 years old Andy is 26 years old%s is a template for strings, %d for integers. %8s means the string has 8 spaces to fill. If the string is shorter, it will add white space to the left. %-8s is similar, with the white space being added to the right.
Unix has a built in command called time that can let you know how long your program takes to run. To get a better feel for how using a LinkedBST and an AVLTree differ in terms of actual run-time, we can use this is a method for experimentally testing our options.
To use time, simply add the word time before the command you wish to run. For example:
time ./cheatDetector inputs/bigList.txt 3 0You will see your normal output, and at the end three statistics will pop out:
real 17m41.461s user 17m29.498s sys 0m10.989sThose provide the total wall time, followed by the division of that time into how much was used by the actual program versus system calls (like accessing the filesystem).
What run times do you get when using a LinkedBST vs AVLTree? You should see a significant speed-up. Report this in your README file. You should think about how this difference compares to our worst case analysis. What are factors that mitigate the difference? What variables hide the true speed up we see between using the two types of trees?
(Note that you may have small deviations from this output, if the differences are small, it won't affect your grade.)
Take a look at the results on the big test. How many of those results seem worth investigating? One of the papers is called catchmeifyoucan.txt, a planted case of cheating. They seemed to have cheated a little from a lot of papers. We may not have noticed this in last week's results due to our strategy for reporting. What are the shortcomings of our approach? What are some improvements we can make? Think both about how we could improve finding common phrases as well measuring overlaps. Note that your results may swap the ordering of documents. But you should ensure that you only compare two documents once.
As usual, submit with git push.