This writeup is broken into two parts that will need to be completed. Here, I will describe your task to complete the main program portion of the assignment. Part B will describe the development of an AVL Tree data structure for creating a more-efficient index of documents.
Sample runs:
This week's lab will develop a plagiarism detection algorithm. Given a set of documents your algorithm will identify significant matches between pairs of documents that indicate shared use of the same material. Specifically, we will test your program on a set of essays submitted by high school students. You can peruse the data set in
/home/jpolitz/public/cs35/cheatDetector/data/Beware, the writing quality may not be up to Swarthmore's high standards.
Your main program will be implemented in cheatDetectorMain.cpp, with most of the work in helpers in cheatDetector.cpp. For each document, you will report the essay with largest number of hits (i.e., matching phrases). The design of your main program will be largely left up to you, but I have broken down the requirements into the following parts:
Make sure your design is easily modified to use either a LinkedBST or an AVLTree to represent a document. Also, do not implement the AVLTree until you have finished your cheatDetector. A good design will make it very easy to incorporate this second BST implementation. The user will specify which they want to use at run-time. If you implement main to the BST interface rather than the linkedBST implementation, this is pretty simple to accomplish.
First, clone your lab directory. There is a main program in the central file cheatDetectorMain.cpp. You'll also get test files in the input directory and Makefile that accounts for the AVLTree.
I have provided an implementation of linkedBST that you can use. It is referenced from the Makefile, and you should be able to copy over your BST tests and ensure that they work correctly against this implementation. You are also free to use your own implementation of linkedBST from the last lab. To do so, follow this script:
$ cd ~/<your-lab-7-dir> $ cp *.h testBST.cpp ~/<your-lab-8-dir>This will let you use your LinkedBST implementation this week.
TODO:
To avoid user-interaction during run-time, a common tool for obtaining input is by using command-line arguments. These are values given when executing from the command-line. To illustrate by example, our usual execution without command-line arguments:
$ ./cheatDetectorMainAn example run for this lab would instead be:
$ ./cheatDetectorMain inputs/smallList.txt 3 0
To process command-line arguments in C++ you need to add two arguments to your main function:
int main(int argc, char* argv[])argc is an int representing the number of arguments entered (including the program name itself), and argv is an array of character arrays, with each character array containing one argument to the program. You can then use argc and argv inside your program to access the command-line arguments. For example, in the run above, argc has the value 4 meaning that argv is an array with four c-string values. argv[0] is always the program being run ("cheatDetector"), argv[1] is "inputs/smallList.txt", argv[2] is "3" and argv[3] is "0". argc is useful to check if the user entered the correct number of arguments. For example, if the user forgets the arguments you can print an error:
$ ./cheatDetectorMain Incorrect number of arguments Usage: cheatDetectorMain file-list phrase-size useAVL
We have provided the central set up already; the provided code converts
argv[1] to a string,
argv[2] to the phraseSize (using atoi which
converts c-strings to ints), and argv[3] to the boolean useAVL.
TODO:
One naive solution to this lab would be to compare all phrases in one document against all phrases in all other documents. That can be extremely costly (O(w^2) for w words in the corpus!). Instead, we will index each essay into a binary search tree to quickly query phrases. The next section will discuss how to load all documents. For now, assume you are attempting to load one document, e.g., data/tyc6.txt. To do so, first check that the file exists, exiting if it does not. This is a great example of something you can put in testCheatDetector.cpp to test individually.
Given the file exists, you will index the document by loading all phrases into a binary search tree. That is, your key will be an n-word phrase string (where n is specified on the command-line) and the value is the number of times that phrase occurs in the document. You should consider all overlapping phrases. For example, for two word phrases, the sentence: "Computer Science 35 rules" leads to the 3 following keys being inserted into the BST: "Computer Science","Science 35","35 rules" .
In addition to allowing phrases of different size, you should determine whether to use a LinkedBST or a AVLTree depending on the boolean value useAVL given as a command-line argument.
The provided command-line argument is a list of the locations of all essays. Each line specifies the location of one document. You can read the entire line in as one long string.
For each document, (1) check that the file exists, (2) load the document into its own tree (see above) and (3) store the BST into some meta-data structure. That last part is up to you to figure out. One way to accomplish this is...with another tree! For instance, you could use each file name as a key, and a pointer to the essay's phrase index tree as the value. This might have the type AVLTree< string, AVLTree< string,int >* > There are other reasonable options if this data type is scary.
For each essay i: For each other essay j: Count the number of matches between i and j If this is the most matches to i, save the result Output i and the j with maximum overlapYou'll also want to report the size of the overlap for the optimal hit. For counting matches, you'll want to take every phrase in essay i and determine if it is in essay j's tree. For every match, increase the total number of matches for that pair.
After your program has completed finding and outputting the largest overlap for each essay, you should output three statistics to give us an understanding of the efficiency difference between AVLTrees and LinkedBSTs.