CS35 Lab 9: Ordering web search results with priority queues

Due by 11:59 p.m., Wednesday, November 09, 2011

Your work for this lab consists of two parts:

  1. Complete and test a templated implementation of a binary heap, BinaryHeap.
  2. Extend your work from Lab 08 to support more sophisticated web queries containing multiple search terms and order web search results using your binary heap.

So far in this course we have been using very general data structures to organize the data for our programs, but we have been just using functions to organize our program's main computation. In addition to the BinaryHeap implementation and slight extensions to Lab 08, one of the goals of this lab is to improve your program design. Specifically, you must reorganize most of the code from your search program into a separate QueryProcessor class, an object that indexes the web pages and then supports web queries. When you are done, your main program should be a relatively simple program that mostly just processes command-line arguments, constructs a QueryProcessor, and then calls a QueryProcessor function for each user query.

As usual, you can get the initial files for this lab by running update35. The new files are listed here, and the files that you will need to edit are highlighted in red:

In the next section we introduce our BinaryHeap implementation. We then describe your main task (the sortedSearch.cc program) and several specific tools you'll need to complete it.



Implementing a BinaryHeap

You should start by implementing and testing the BinaryHeap class. We have defined the necessary data and many functions in binaryheap.h; you may edit this file if you want but it is probably not necessary. You should implement each function defined in the binaryheap.inl implementation file.

The main data in the heap is represented as an array of priority-value pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class from week 04.

You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way to ensure that each private function is executed and tested.



The QueryProcessor and your main sortedSearch program

The purpose of the QueryProcessor class is to encapsulate all the functionality of web indexing and search. In many ways its interface is similar to the main search program from Lab 08:

Once your QueryProcessor is complete, your main sortedSearch program can merely process the command-line arguments and user input, using each user-input query as an argument to the QueryProcessor::executeQuery function and printing the highest-ranked URLs.

For example:

  $  ./sortedSearch urls.txt ignore.txt
  Enter your search query:  tree
  Searching for "tree":
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab08.php: 21
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab07.php: 13

  Enter your search query:  node
  Searching for "node":
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab07.php: 11
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab08.php: 10
      www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter your search query:  tree node
  Searching for "tree node":
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab08.php: 31
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab07.php: 24
      www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter your search query:  sleep time
  Searching for "sleep time":
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab04.php: 23
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab03.php: 8
      www.cs.swarthmore.edu/~charlie/cs21/s11/Labs/lab06.php: 6
      www.cs.swarthmore.edu/~charlie/cs21/f10/index.php: 6
      www.cs.swarthmore.edu/~charlie/cs21/s11/index.php: 6
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab08.php: 5
      www.cs.swarthmore.edu/~charlie/cs21/s11/Labs/lab05.php: 4
      www.cs.swarthmore.edu/~charlie/cs21/f10/Labs/lab06.php: 3
      www.cs.swarthmore.edu/~charlie/cs35/f11/labs/lab01.php: 3
      www.cs.swarthmore.edu/~charlie/cs21/s11/Labs/lab07.php: 3
      www.cs.swarthmore.edu/~charlie/cs21/f10/Labs/lab04.php: 2
      www.cs.swarthmore.edu/~charlie/cs21/f10/Labs/lab05.php: 2
      www.cs.swarthmore.edu/~richardw/index.html: 1
      www.cs.swarthmore.edu/~charlie/cs35/f11/index.php: 1
      www.cs.swarthmore.edu/~charlie/cs21/f10/Labs/lab07.php: 1

  Enter your search query:  <CTRL-D>
  $


Parsing user input in C++

Parsing user input in C++ can be less pleasant than in other languages. In the QueryProcessor we've provided a private function, split, that behaves much like the split function in Python. Specifically, given a string as input, it returns a pointer to a heap-allocated list of the whitespace-separated tokens from that string. For example,

  split("This is    a test");
would return a pointer to list containing "This", "is", "a", and "test". The caller is responsible for freeing the memory referred to by that pointer.



Submitting your work

Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.

You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.