CS35 Lab 8: Analyzing web content with balanced binary search trees

Due by 11:59 p.m., Wednesday, November 7, 2012 at 11:59pm

Your work for this lab consists of two parts:

  1. Complete and test a templated implementation of an AVL tree, AVLTree.
  2. Use the AVLTree to build an index of user web pages hosted by www.cs.swarthmore.edu. Given a search query, use your AVLTree-based index to find web pages that match the search query.

As usual, you can get the initial files for this lab by running update35. When you run update35 you will obtain a large number of files for this lab, but you need to edit only four of them to complete this lab. The files that you will need to edit are highlighted below. Many of the files from previous labs are in the library subdirectory and are not listed here:

In the next section we introduce our AVLTree implementation. We then describe your main task (the search.cpp program) and several specific tools you'll need to complete it.



A balanced binary search tree

You should start by completing and testing the AVLTree implementation. The AVLTree is very similar to the LinkedBST from Lab 07, except it maintains the AVL balance property to guarantee that the tree is balanced. (The AVL balance property is that, for each node in the tree, the height of that node's left and right subtrees differs by at most one.)

To efficiently maintain the AVL property, each AVLTreeNode stores its current height in addition to the other data stored by a LinkedBSTNode. In sum, each AVLTreeNode contains:

Our algorithms define an empty child node to have a height of -1 for simplicity, but an empty child node is not an actual node. It is the pointer NULL, so we have been very careful to avoid segmentation faults in our code. This makes the code a little messier than the discussion in class, but you should be able to follow the methods. You will need to be careful to check that children nodes are not NULL before attempting to access information such as a child node's height or children.

To complete the AVLTree, you should complete each of the rotation functions marked TODO in AVLTree.inl. There is one rotation function for each of the four cases discussed in class:

These functions are called by the balance function (which is in turn called by insertInSubtree and removeFromSubtree), which we have implemented for you. You should compare these methods to the LinkedBST implementation from last week and notice that there is very little change in their definitions. You should also study the balance method to understand how to detect imbalances when working back up the tree.



Unit-testing the AVLTree

You should test the AVLTree implementation using the testAVL program. Ideally, your tests should confirm that each of your rotation functions behaves as expected for known examples of the tree, and also stress test the AVLTree implementation for a large number of random insertions and deletions.

You should write whatever tests are needed to evaluate your work, but you do not need to test other AVLTree functions that we provided for you.



Analyzing web content

Your main work for this lab is to write a search program, inside search.cpp that analyzes www.cs.swarthmore.edu web pages and allows a user to find pages that match a search query. This program should take two command-line arguments:

Your program should do the following:

  1. Open and read the ignores-file, saving each word in the file into a tree containing all words to ignore when searching web content.
  2. Open and read the urls-file. For each URL in the file, use the SwatHTMLStream to read the web page for that URL, creating an index of the word frequencies for that web page. In your indexes do not include any words contained in the ignores-file.
  3. Using standard input and output, allow the user to search for terms in the web pages. For each search term, display the URL of each page containing the term and the number of occurrences of the term in that web page.

For example:
$ ./search inputFiles/urls.txt inputFiles/ignore.txt
Enter your earch query: lab
Search results for "lab":
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab02.php:  5
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php:  6
        www.cs.swarthmore.edu/~newhall/index.php:  2
        www.cs.swarthmore.edu/~aviv/index.html:  1
        www.cs.swarthmore.edu/~adanner/index.php:  3
        www.cs.swarthmore.edu/~jwaterman/index.php:  2
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab02.php:  4
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab01.php:  5
        www.cs.swarthmore.edu/~richardw/index.html:  2
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php:  6
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab03.php:  8
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab03.php:  2
        www.cs.swarthmore.edu/~soni/cs21/f11/index.php:  20
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php:  7
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php:  17
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php:  5
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab01.php:  8
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php:  5
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php:  9
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab04.php:  5
        www.cs.swarthmore.edu/~soni/cs21/s12/index.php:  19
        www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php:  7
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab01.php:  7
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab05.php:  12
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php:  14
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab03.php:  2
        www.cs.swarthmore.edu/~soni/cs35/f12/index.php:  23
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php:  11
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab06.php:  14
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php:  16
        www.cs.swarthmore.edu/~soni/index.php:  2

Enter search query: the
Search results for "the":
No results found

Enter search query: intelligence
Search results for "intelligence":
        www.cs.swarthmore.edu/~meeden/index.html:  3
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php:  2

Enter search query: systems
Search results for "systems":
        www.cs.swarthmore.edu/~newhall/index.php:  11
        www.cs.swarthmore.edu/~aviv/index.html:  2
        www.cs.swarthmore.edu/~adanner/index.php:  6
        www.cs.swarthmore.edu/~meeden/index.html:  5
        www.cs.swarthmore.edu/~richardw/index.html:  3
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php:  2

Enter search query: machine
Search results for "machine":
        www.cs.swarthmore.edu/~richardw/index.html:  1
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php:  2
        www.cs.swarthmore.edu/~soni/index.php:  4

Enter search query: database
Search results for "database":
        www.cs.swarthmore.edu/~newhall/index.php:  1
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php:  2

Enter search query: algorithms
Search results for "algorithms":
        www.cs.swarthmore.edu/~newhall/index.php:  2
        www.cs.swarthmore.edu/~adanner/index.php:  7
        www.cs.swarthmore.edu/~meeden/index.html:  1
        www.cs.swarthmore.edu/~richardw/index.html:  1
        www.cs.swarthmore.edu/~soni/cs21/f11/index.php:  2
        www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php:  1
        www.cs.swarthmore.edu/~soni/cs21/s12/index.php:  2
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab05.php:  1
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php:  2
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab03.php:  4
        www.cs.swarthmore.edu/~soni/cs35/f12/index.php:  6
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php:  1
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab06.php:  9
        www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab08.php:  1
        www.cs.swarthmore.edu/~soni/index.php:  2

Enter your search query: <CTRL-D>
$ 

Note that the order of your results may differ depending on how you process your search, but the content should be the same.

For each URL you should build a tree to store the frequency of each word in the web page; each of these trees will probably have a type such as AVLTree<string,int>. But wait, there's more! You will also need a method to associate each URL with its word-frequency tree. One way to accomplish this is...with another tree! For instance, you could use each URL as a key, and a pointer to the URL's word-frequency tree as the value. This might have the type

AVLTree< string, AVLTree<string,int>* >
There are other reasonable options if this data type is scarier than your Halloween costume.

Additional Requirements:
  1. Pay close attention to what memory allocation is handled for you and what is your responsibility to manage. Remember, every new must have a corresponding delete.
  2. You must use good design principles. Your main() function should read like a paper outline and not be very long. Your functions should encapsulate a well-defined concept.
  3. You must handle possible file errors. What does your program do if you give an incorrect file name? What if your url file contains duplicates? Or an invalid url? How do you know if the url is invalid?

Below we give some specific advice for this lab, including how to read command-line arguments in C++, how to use C++ file streams and the SwatHTMLStream, and how to create your own web page for testing purposes.



Command-line arguments in C/C++

To process command-line arguments in C++ you need to add two arguments to your main function:

  int main(int argc, char** argv) {
     // ...
     return 0;
  }
argc is an int representing the number of arguments entered (including the program name itself), and argv is an array of character arrays, with each character array containing one argument to the program. (These variables can be called anything you want, but it's a long-standing convention to call them argc and argv.) You can then use argc and argv inside your program to access the command-line arguments. For example, if you run your program:
$ ./search inputFiles/urls.txt inputFiles/ignore.txt
argc has the value 3 (there are three parameters). That means that argv is a c-style array with three values. argv[0] is the value "search", argv[1] is "inputFiles/urls.txt" and argv[2] is "inputFiles/ignore.txt". argc is useful to check if the user entered the correct number of arguments. For example, if the user forgets the arguments you can print an error:
$ ./search
Usage: search <url-file> <ignore-file>
Required options
  <url-file>:     a file of urls to index and search
  <ignore-file>:  a file of words to ignore



File streams and the SwatHTMLStream

We've been using cin and cout to get input from and display output to the user, but cin and cout are just two example input and output streams in C++. In this lab you will create and use an input stream to read from a file, much like we've read input from cin. For example:

  #include <iostream>
  #include <string>
  #include <fstream>

  using namespace std;

  int main(int argc, char** argv) {
    ifstream myFile("inputFiles/urls.txt");

    if (!myFile.good()) {
      cerr << "Whoops!  We couldn't open inputFiles/urls.txt.\n";
      return 1;
    }

    string url;
    myFile >> url;
    while (!myFile.eof()) {
      cout << url << endl;
      myFile >> url;
    }
    myFile.close();

    return 0;
  }
To use a file stream you must include the fstream library. This program creates a file input stream, myFile, to read from the file urls.txt. It then checks that the input stream is valid (i.e., that the file exists and is readable), and then uses the input stream to read each input from the file. At the end of the file, the program closes the file and exits.

This program also demonstrates cerr, another output stream like cout. The usual convention is to use cout for standard output when the program is normally executing, but to use cerr for unusual output such as error messages.

For this lab we have also provided SwatHTMLStream, a stream-like class that allows you to read HTML files hosted by the local CS web server, www.cs.swarthmore.edu. To use the SwatHTMLStream you must be logged into a Swarthmore Computer Science lab computer. You can use the SwatHTMLStream much like a file stream:

  #include <iostream>
  #include <string>
  #include "swatHTMLStream.h"

  using namespace std;

  int main(int argc, char** argv) {
    SwatHTMLStream myFile("www.cs.swarthmore.edu/~soni/index.html");

    if (!myFile.good()) {
      cerr << "Whoops!  We couldn't open the url.\n";
      return 1;
    }

    string word;
    myFile >> word;
    while (!myFile.eof()) {
      cout << word << endl;
      myFile >> word;
    }
    myFile.close();

    return 0;
  }
The SwatHTMLStream is not a fully-featured input stream, but can read strings of words (and number-strings) from a given URL. It skips HTML tags (for correctly-formatted HTML) and most punctuation characters, and converts all strings to lowercase.



Creating your own web page
If you want to create your own web page as a test (or otherwise), do the following:
  1. In your home directory, make a public_html directory to hold the web page, then make that directory readable and executable for everyone:
      $ mkdir ~/public_html
      $ chmod a+rx ~/public_html
    
  2. Inside your public_html directory make an index.html file, then make that file readable by everyone:
      $ chmod a+r ~/public_html/index.html
    
    You can start with a simple text file (e.g., Hello world!) and add more to it later. (See other users' web pages for examples of what you can do in HTML.)
  3. After you've saved your index.html file and made it world-readable, use a web browser to view: http://www.cs.swarthmore.edu/~yourUserName


Submitting your work

Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.

You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.