CS35: Homework #6
You may work with one partner on this assignment.
For this program you will implement part of a web search engine that orders web pages based on how well they match a search query. The best match is the web page with the highest word frequency counts for the words in the query string. Your main class for this assignment should be called ProcessQueries and will be called as follows:
The urlListFile should contain a list of URLs, one per line. These URLs have to correspond to files that you can open locally (URLs whose corresponding .html file is stored on our file system). An example urlListFile might contain:
www.cs.swarthmore.edu/~cfk www.cs.swarthmore.edu/~erkan www.cs.swarthmore.edu/~griffin www.cs.swarthmore.edu/~knerr www.cs.swarthmore.edu/~marshall www.cs.swarthmore.edu/~meeden www.cs.swarthmore.edu/~newhall www.cs.swarthmore.edu/~knerr/cs21/s00/cs21.html www.cs.swarthmore.edu/~newhall/cs35/index.html www.cs.swarthmore.edu/~newhall/cs75/index.html
The ignoreFile should contain a list of words that you would like to ignore as you count word frequencies in html files (just as you did in the last assignment).
In order to process queries from a user, you'll need to create a new class that joins together a URL string with a WordFrequencyTree representing that web page's content. Call this class URLContent. Your program should create a list of URLContent objects, one for each URL that appears in the urlListFile.
Once you have processed all the URLs in the list (you should gracefully handle invalid URLs), your program will enter a loop as shown below, which prompts the user to enter a search query (or -1 to quit), and then lists all URL's that match the query in order of the best match first and the worst match last. Include each result URL's priority in parenthesis after each result. URLs of web pages that do not contain any of the words in the query should not appear in the result list.
Enter a query or -1 to quit. Search for: neural networks Relevant pages: www.cs.swarthmore.edu/~meeden (priority = x) www.cs.swarthmore.edu/~marshall (priority = y) Search for: evolutionary computation Relevant pages: www.cs.swarthmore.edu/~meeden (priority = x) www.cs.swarthmore.edu/~marshall (priority = y) www.cs.swarthmore.edu/~cfk (priority = z) Search for: -1
To find the results of the query in order, you will process each WordFrequencyTree in the list of URLContent objects, create a priority queue element for it, and add it to a priority queue for the search. Then use the priority queue to print out the matching urls in order. The priority value is based on how well the web page matches the words in the query. Remember that in a priority queue low values equate with high priority.
Much of this assignment will be figuring out how to use some of the classes that we give you. Once you have run the test programs for these classes, and understand how they work, then you can start implementing code.
Start by implementing the insert method in the HeapPriorityQueue class. Test that this works before moving on to the next part.
Next, implement the part of your program that processes the urlListFile. For each URL read in, create the appropriate file name according to the following rules. Then calculate the word frequencies for that file.
Next, implement that part that reads in a search query, builds a priority queue by inserting (URLContent, key) pairs where the key is the priority of the URL's WordFrequencyTree based on how well it matches the query string. Then print out the matching URLs in order of best to worst match.
Your program should handle multiple word queries, and return the best matches based on all words in the query. For example, the query "computer science department" should search each URL's WordFrequencyTree for all three words to determine the URL's priority.
Classes you'll need for this assignment include all the classes for assignment 6 plus the following (these can be copied from ~newhall/public/cs35/hw07/classes/):
ReadStream r = new ReadStream(new FileInputStream(new File(url_list)));Then just enter a loop that reads in the next URL (use the readLine method) until eof() is true.