CS35 Lab 10: Using a hash table to cache query results

Due by 11:59 p.m., Wednesday, April 11, 2012
I encourage you to work with a partner on this lab.

Introduction

This lab is the third part of a larger project to build a web browser with a search engine for a limited portion of the web. In the first part you summarized the content of a web page by creating a binary search tree that contained the word frequencies of all words that appeared in the page that were not in a given file of words to ignore. In the second part you read in a list of URLs and constructed word frequency trees to summarize their content. Next you prompted the user for a query, searched for each query word in the saved word frequency trees, and then used a priority queue to report the most relevant pages for the query.

In the third part of this project, you will make the search for relevant pages more efficient by caching search results in a dictionary. Each time a query is entered, you will first check a dictionary to see if this query has been answered before. If so, you can get the results directly from the dictionary without any further processing. If not, you will create the results as you did in last wee's lab and add them to the dictionary.

We'd like to be able to use the cache as much as possible to make our web browser fast. To this end we will remove all ignore words from the query, sort the remaining words, and concatenate them together to make a string key for our dictionary. In this way the queries "computer science department" and "department of computer science" will both end up with the same key "computer department science", assuming "of" is one of the ignore words. Using this approach will allow us to find saved results more frequently, and take the most advantage of the cache.

Your work for this lab consists of three parts:

  1. Complete an implementation of a HashTable, which implements the Dictionary interface.
  2. Augment the BinaryHeap from Lab 09 with a copy constructor, which constructs a new BinaryHeap from another BinaryHeap.
  3. Extend your search program from Lab 09 to use a hash table to cache query results, returning a PriorityQueue from the query result cache when possible instead of recomputing the query result.

As usual, you can get the initial files for this lab by running update35. The new files are listed here, and the files that you will need to edit are highlighted in blue:

You will also need to include your queryprocessor.h, queryprocessor.cpp, and sortedSearch.cpp solutions from Lab 09 and modify the QueryProcessor files to support query result caching. Begin by copying these files over

$ cd
$ cd cs35/labs/09
$ cp queryprocessor* sortedSearch.cpp binaryHeap.inl ../10



Completing the HashTable implementation

You should start by completing and testing the HashTable class. Our HashTable uses an array of CircularArrayLists to store the data within the hash table, using chaining to resolve collisions. Note that this is not a CircularArrayList of CircularArrayList. table is a standard C++ array, where each index contains an actual CircularArrayList object (not a pointer). You should be able to draw this out as a memory diagram.

Each CircularArrayList stores key-value pairs; for keys of type K and values of type V each key-value pair therefore has type Pair<K,V>.

Please complete each function marked with a TODO comment in hashtable.inl. You must test each HashTable function you implement using the testHash program.



Implementing the BinaryHeap's copy constructor

Once you've completed and tested the HashTable class, continue by implementing and testing the BinaryHeap copy constructor. A copy constructor is a constructor that creates a copy of a class from another instance of the same class, in this case a BinaryHeap. (The parameter for a copy constructor actually needs to be a const reference to an instance of the same class, but this is a C++ detail that you don't need to learn.)

To copy an object you can usually directly copy (by value) most data from the source object, but you must avoid copying some data (such as a pointer to a memory location) that would make the source object and the copy dependent on each other. For an example of a copy constructor, see the newly-provided CircularArrayList class.

The copy constructor is invoked just as any other constructor, but the argument is itself another instance of the class. So, for example:

BinaryHeap < int,int > heap1; //Invokes default constructor

//Insert elements into heap1

BinaryHeap < int,int > cpy (heap1); //Invokes copy constructor, copying 
                                 // all of the data from heap1 over to cpy
This works exactly the same way with dynamic memory except we need to dereference pointers since the actual object needs to be sent in:
BinaryHeap < int,int >* heap1 = new BinaryHeap< int,int > //Invokes default constructor

//Insert elements into heap1

BinaryHeap < int,int >* cpy = new BinaryHeap< int,int >(*heap1);



The QueryProcessor and your main sortedSearch program

If your sortedSearch program from Lab 09 was well-designed, you can complete this portion of the lab by editing only your queryprocessor.h and queryprocessor.inl implementation, without modifying sortedSearch.cpp.

As described in the introduction above, your modified QueryProcessor should first check a dictionary to learn if this query has been answered before; if so, it can return a copy of the result (using the BinaryHeap's copy constructor!) from the dictionary without any further processing. (It's important to return a copy -- and not a value from the cache itself -- to prevent the main program from modifying the BinaryHeaps inside the query cache.) If the query has not been answered before, you compute the query result as in Lab 09, save the results in the dictionary, and then return them.

Your query result cache should be a hash table using query strings as keys and PriorityQueue pointers (BinaryHeap pointers) as values. To maximize the chance that a query string uses the cache, you should first convert the query string into a canonical form before using it as the cache key. To create the cache key from a search query, do the following:

  1. Convert the search query to all lower case, using the tolower function. You'll need to #include <cctype> before using tolower. tolower converts one character to lower case, so you should use an accumulator to convert each individual character and add it on to the new query term. E.g.,
          string word = "Hello";
          string lower = "";
          for(int i = 0; i < word.length(); i++)
            lower += tolower(word[i]);
          
  2. Split the search query into whitespace-delimited words using the split function.
  3. Remove all ignored words from the search query.
  4. Sort the remaining words.
  5. Concatenate the sorted remaining words into a single string, with each word separated by a space.
The final string (after step 5) will be the cache key for the search query.

For example, if "of" is an ignored word, the search query "department of computer science" would be converted to the cache key "computer department science". Similarly, "Computer Science department" would also be converted to "computer department science", which would allow the cache to return a saved query result instead of recomputing the result.

Please print output to tell us when a query result is newly-computed and when a query result was used from the cache. For example:

  $  ./sortedSearch inputs/urls.txt inputs/ignore.txt
  Enter your search query:  tree
  Searching for "tree":
  Cache miss.
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 33
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 21
      www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9

  Enter your search query:  node
  Searching for "node":
  Cache miss.
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 12
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 10
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6
      www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter your search query:  tree node
  Searching for "tree node":
  Cache miss.
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31
      www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6
      www.cs.swarthmore.edu/~newhall/index.php: 1
  
  Enter your search query:  tree node
  Searching for "tree node":
  Returning query result from cache for "node tree":
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31
      www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6
      www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter your search query:  A Node Tree
  Searching for "A Node Tree":
  Returning query result from cache for "node tree":
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31
      www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6
      www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter your search query:  the a nOdE trEE the a
  Searching for "the a nOdE trEE the a":
  Returning query result from cache for "node tree":
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php: 45
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab08.php: 31
      www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9
      www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php: 6
      www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter your search query:  <CTRL-D>
  $


Submitting your work

Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.

You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.