CS35 Lab 9: Ordering web search results with priority queues

Due by 11:59 p.m., Wednesday, November 14, 2012

Your work for this lab consists of two parts:

  1. Complete and test a templated implementation of a binary heap, BinaryHeap.
  2. Extend your work from Lab 08 to support more sophisticated web queries containing multiple search terms and order web search results using your binary heap.

So far in this course we have been using very general data structures to organize the data for our programs, but we have been just using functions to organize our program's main computation. In addition to the BinaryHeap implementation and slight extensions to Lab 08, one of the goals of this lab is to improve your program design. Specifically, you must reorganize most of the code from your search program into a separate SearchEngine class, an object that indexes the web pages and then supports web queries. When you are done, your main program should be a relatively simple program that mostly just processes command-line arguments, constructs a SearchEngine, and then calls a SearchEngine function for each user query (or search).

As usual, you can get the initial files for this lab by running update35. The new files are listed here, and the files that you will need to edit are highlighted in blue:

In the next section we introduce our BinaryHeap implementation. We then describe your main task (the sortedSearch.cpp program) and several specific tools you'll need to complete it.



Implementing a BinaryHeap

You should start by implementing and testing the BinaryHeap class. Be sure to map out a memory diagram of an array representation of a heap. When implementing functions, ask what your data looks like. If there is a child for a current position, what index is it at? How do I know if a index is a leaf, or specifcally a child? HINT: You don't need to actually look at the data in the nodes to figure this out.

We have defined the necessary data and many functions in binaryheap.h; you may edit this file if you want but it is probably not necessary. You should not add any additional public methods, unless for temporary debugging purposes. You should implement each function defined in the binaryheap.inl implementation file.

The main data in the heap is represented as an array of priority-value pairs. For priorities of type P and values of type V, each pair therefore has type Pair<P,V>. This array stores a level-order traversal of the binary heap, as described in lecture. Note that these are not pointers to Pair values, so you can't delete or allocate them for memory mangagement. The BinaryHeap is internally responsible for the memory management of this array, much like the data array in the ArrayList class from week 04.

You must test each BinaryHeap function you write. For private internal functions, you should use the BinaryHeap's public interface in a way to ensure that each private function is executed and tested. Your testHeap.cpp file will be analyzed when grading, and you will be required to show your testing strategy when receiving help from a ninja or myself.



The SearchEngine and your main sortedSearch program

The purpose of the SearchEngine class is to encapsulate all the functionality of web indexing and search. In many ways its interface is similar to the main search program from Lab 08:

Once your SearchEngine is complete, your main sortedSearch program can merely process the command-line arguments and user input, using each user-input query as an argument to the SearchEngine::executeSearch function and printing the highest-ranked URLs.

For example:

  $  ./sortedSearch urls.txt ignore.txt
  Enter search query:  tree
  Search results for "tree":
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 40
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9

  Enter search query:  node
  Search results for "node":
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 12
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php: 5
	www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter search query:  tree node
  Search results for "tree node":
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 52
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 9
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php: 5
	www.cs.swarthmore.edu/~newhall/index.php: 1

  Enter search query:  sleep time
  Search results for "sleep time":
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php: 12
	www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 6
	www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 5
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php: 5
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 4
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 4
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php: 4
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab03.php: 4
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php: 3
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 3
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab01.php: 3
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 2
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab06.php: 2
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php: 1
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php: 1
	www.cs.swarthmore.edu/~richardw/index.html: 1
	www.cs.swarthmore.edu/~soni/cs35/f12/index.php: 1

  Enter search query: lab
  Search results for "lab":
	www.cs.swarthmore.edu/~soni/cs35/f12/index.php: 22
	www.cs.swarthmore.edu/~soni/cs21/f11/index.php: 20
	www.cs.swarthmore.edu/~soni/cs21/s12/index.php: 19
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php: 17
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab04.php: 14
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab06.php: 14
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab05.php: 12
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab07.php: 11
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php: 9
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab01.php: 8
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab03.php: 8
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab01.php: 7
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php: 7
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php: 7
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php: 6
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php: 6
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php: 5
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab01.php: 5
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab04.php: 5
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php: 5
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab02.php: 5
	www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab02.php: 4
	www.cs.swarthmore.edu/~adanner/index.php: 3
	www.cs.swarthmore.edu/~newhall/index.php: 2
	www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab03.php: 2
	www.cs.swarthmore.edu/~richardw/index.html: 2
	www.cs.swarthmore.edu/~soni/index.php: 2
	www.cs.swarthmore.edu/~jwaterman/index.php: 2
	www.cs.swarthmore.edu/~soni/cs35/f12/Labs/lab03.php: 2
	www.cs.swarthmore.edu/~aviv/index.html: 1

  Enter search query: the
  Search results for "the":
    No results found

  Enter search query:  <CTRL-D>
  $
Note that in this lab, you can now search multiple-word phrases. You will want to obtain the entire user phrase at once (e.g., using getline()) and then query each word individually (HINT: see the instructions below on how to split a string). The final result is the union of all occurences of each word in the query. So, for example, tree occurred 21 times on the lab08 page and node occurred 10 times on that page. The resulting priority for tree node, as a result, is simply the sum of those two individual frequences (i.e., 31).

The requirements for the SearchEngine are the same as for last week's lab.. That is, you should handle files properly including in the case that an incorrect file name is given. And you must properly deallocate any heap memory belonging to the object.

Parsing user input in C++

Parsing user input in C++ can be less pleasant than in other languages. In the SearchEngine we've provided a private function, split, that behaves much like the split function in Python. Specifically, given a string as input, it returns a pointer to a heap-allocated list of the whitespace-separated tokens from that string. For example,

  split("This is    a test");
would return a pointer to list containing "This", "is", "a", and "test". The caller is responsible for freeing the memory referred to by that pointer.

Detecting memory problems using valgrind
valgrind is a tool for finding memory access errors to heap memory (i.e, memory that is dynamically allocated with new). Handling memory is an extremely important concept in C++, but also notoriously difficult to debug and verify. Being competent in valgrind can save you hours of debugging by providing you useful information quickly.

A helpful tutorial is available via Prof. Newhall's tutorial. You can test valgrind out by running it in last week's lab. To do so, type the program name valgrind at the Unix prompt followed by the exact execution call for your program. For example if your program was run last week by typing:

$ ./search inputFiles/urls.txt inputFiles/ignore.txt
then to run valgrind:
$ valgrind ./search inputFiles/urls.txt inputFiles/ignore.txt

==27852== Memcheck, a memory error detector
==27852== Copyright (C) 2002-2010, and GNU GPL'd, by Julian Seward et al.
==27852== Using Valgrind-3.6.1 and LibVEX; rerun with -h for copyright info
==27852== Command: ./search inputFiles/urls.txt inputFiles/ignore.txt
==27852== 
Enter search query: time
Search results for "time":
www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab01.php:  1
www.cs.swarthmore.edu/~richardw/index.html:  1
www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab02.php:  4
www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab06.php:  11
www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab04.php:  1
www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab05.php:  2
www.cs.swarthmore.edu/~soni/cs21/f11/index.php:  6
www.cs.swarthmore.edu/~soni/cs21/f11/Labs/lab07.php:  2
www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab06.php:  4
www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab05.php:  1
www.cs.swarthmore.edu/~soni/cs21/s12/index.php:  5
www.cs.swarthmore.edu/~soni/cs21/s12/Labs/lab07.php:  3
www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab04.php:  4
www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab03.php:  4
www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab07.php:  1
www.cs.swarthmore.edu/~soni/cs35/s12/Labs/lab06.php:  2
www.cs.swarthmore.edu/~soni/cs35/s12/index.php:  1

Enter search query: <CTRL-D>
==27852== 
==27852== HEAP SUMMARY:
==27852==     in use at exit: 859,137 bytes in 26,330 blocks
==27852==   total heap usage: 157,727 allocs, 131,397 frees, 4,854,882 bytes allocated
==27852== 
==27852== LEAK SUMMARY:
==27852==    definitely lost: 768 bytes in 32 blocks
==27852==    indirectly lost: 858,369 bytes in 26,298 blocks
==27852==      possibly lost: 0 bytes in 0 blocks
==27852==    still reachable: 0 bytes in 0 blocks
==27852==         suppressed: 0 bytes in 0 blocks
==27852== Rerun with --leak-check=full to see details of leaked memory
==27852== 
==27852== For counts of detected and suppressed errors, rerun with: -v
==27852== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 4 from 4)
Uh-oh! I didn't take care of all my memory. To get more details, add the --leak-check=full flag to get a stack trace of where all lost heap memory originated:
$ valgrind --leak-check=full ./search inputFiles/urls.txt inputFiles/ignore.txt

Submitting your work

Once you are satisfied with your program, hand it in by typing handin35 at the unix prompt.

You may run handin35 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize after handing in some programs that you'd like to make a few more changes to them.