CS35 Project: Part 2


Part 2: Processing User Queries to Find the Most Relevant Web Pages

CONTENTS


PROBLEM DESCRIPTION

For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:

The urlListFile should contain a list of URLs, one per line. These URLs have to correspond to files that you can open locally (URLs whose corresponding .html file is stored on our file system). An example urlListFile might contain:

The ignoreFile should contain a list of words that you would like to ignore as you count word frequencies in html files (just as you did in the last assignment).

In order to process queries from a user, you'll need to create a new class that joins together a URL string with a WordFrequencyTree representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile.

Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URL's that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis after each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list.

To find the results of the query in order, you will process each WordFrequencyTree in the list of URLContent objects, create a priority queue element for it, and add it to a priority queue for the search. Then use the priority queue to print out the matching urls in order. The priority value is based on how well the web page matches the words in the query. Remember that in a priority queue low values equate with high priority.


GETTING STARTED

Much of this assignment will be figuring out how to use some of the classes that we give you. Once you have run the test programs for these classes, and understand how they work, then you can start implementing code.

Start by implementing the insert method in the HeapPriorityQueue class. Test that this works before moving on to the next part.

Next, implement the part of your program that processes the urlListFile. For each URL read in, create the appropriate file name according to the following rules. Then calculate the word frequencies for that file.

Next, implement that part that reads in a search query, builds a priority queue by inserting (URLContent, key) pairs where the key is the priority of the URL's WordFrequencyTree based on how well it matches the query string. Then print out the matching URLs in order of best to worst match.

Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WordFrequencyTree for all three words to determine the URL's priority.


CLASSES

For this and all remaining homework assignments, you will need to copy the .java and .class files that we give you from our directories.

Classes you'll need for this assignment include all the classes for assignment 6 plus the following (these can be copied from ~newhall/public/cs35/hw07/classes/):

In addition, if you did not complete assignment 6, then you can use our solution as a starting point for your assignment 7.
The .java files that we gave you as a starting point, can be copied from ~newhall/public/cs35/hw06/classes/.
Our solution .class files, for the following classes, can be copied from ~newhall/public/cs35/hw06/solution/

HAND IN

Using cs35handin, hand in a single tar file containing:
  1. All .java files necessary for compiling your code (include any of the classes that we give you that you use in your solution).
  2. A Makefile for building your code
  3. A README file with:
    1. Your name (and your partner's name if you had one)
    2. The name of the class containing your main method
If you work with a partner, please only one of you submit your joint solution using cs35handin.