CS35: Homework #9

CS35 Spring 2002
Homework 9: Adding a GUI front-end to the search engine
and adding a cache of search query results
Due: Thursday, April 25 before 2 am (late Wednesday night)

Problem Description
Getting Started
Java Classes
What to Hand In

You may work with one partner on this assignment.

PROBLEM DESCRIPTION

This program consists of two main parts, and will count towards more of your grade than a regular homework assignment.

Part 1 of the assignment is building a GUI front-end to your search engine that also performs the fetching and displaying URLs job of a web browser. Part 2 is adding caching of query results to your search engine, and speeding up the execution of queries by using the cached results of previous queries.

The solution that you submit should link the two parts together. However, you can start working on each part independently. When you add the second part to the GUI Web browser, print the caching information to the search results window as well as to stdout.

Part 1: GUI

Build a GUI front-end to your search engine. It should have the following 6 GUI components:

a query text box (where the user can type in a query)
a search button
a URL text box (where the user can type in a url)
a fetch URL button
a display area for search results
a display area for fetched urls

Your WebBrowser's main method will start up by taking a url_list and ignore_list command line options, create a ProcessQueries object (from hw#7), and then create the GUI front-end.

The WebBrowser GUI will work like the following:

When the get URL button is selected, the url entered in the URL text box is evaluated, the url's webpage is fetched and its contents are displayed in the display box for fetched urls.
When the search button is selected, the query entered in the query box is evaluated and the search results box displays the ordered list of matching webpages (use your hw#7 solution). In addition, the webpage of the best match is fetched and displayed in the display box for fetched urls.

Part 2: Caching

Below is a demonstration of how your more efficient query processing program should behave. You will save all previous query results in a data structure called the cache. When the user types in a query, you will first check the cache to see if you have processed a similar request in the past.

#####################################
Enter next Query String or -1 to quit
#####################################
artificial

  Results for your Query "artificial":
  -----------------------------------------------
  Query: artificial
        Query Result NOT Found in cache		<-------- print out whether or not query is in cache
        0 words from query found in cache	<-------- if not, print out how many words of the query are in the cache

  Matching URLs:
  www.cs.swarthmore.edu/~meeden   priority: -8

#####################################
Enter next Query String or -1 to quit
#####################################
artificial intelligence

  Results for your Query "artificial intelligence":
  -----------------------------------------------
  Query: artificial intelligence
        Query Result NOT Found in cache
        1 words from query found in cache

  Matching URLs:
  www.cs.swarthmore.edu/~meeden   priority: -17

#####################################
Enter next Query String or -1 to quit
#####################################
intelligence

  Results for your Query "intelligence":
  -----------------------------------------------
  Query: intelligence
        Query Result Found in cache

  Matching URLs:
  www.cs.swarthmore.edu/~meeden   priority: -9

Notice that the first time the user asks about "artificial" we cannot find anything in the cache. But the next time the user asks about "artificial intelligence" we can grab the stored information about "artificial" and perform a search of WordFrequencyTrees for just the "intellignece" part. When the query "intelligence" is entered, its result simply can be obtained from the stored informatation in the cache. Also note that we want to be able to recognize that a query is the same regardless of the word order, so "artificial intelligence" is the same as "intelligence artificial".

GETTING STARTED

Part 1: GUI

Start by reading the Tutorial example about the JEditorPane class, and try out the example (TextSamplerDemoHelp.java) to get an idea of how to display .html files and search results (look at the JEditorPane's setPage and setText methods), and how to associate scroll bars with panes:

http://www.javasoft.com/docs/books/tutorial/uiswing/components/simpletext.html

Getting your GUI to display url files and to display search result strings is easy if you use the JEditorPane's setPage and setText methods. Remember to import javax.swing.text.* and java.net.URL along with the standard Swing imports.

You will probably want to use the GridBagLayout to layout the buttons and the text display in the same panel. This will likely give the best results when the Window is re-sized (you want the display parts of the window to re-size vertically, but probably not the buttons and the search text and url text boxes).

To get the query result to print out nicely in the display, you need to convert the query results to a string (either in html or ASCII). One way to do this is to add a queryToString method (or a queryToHTMLString method) of the ProcessQuery class that takes as input the result returned by the performQuery method and converts the results to an string generating "\n" for new lines or < br > for new lines. You can then call setText on the JEditorPane to display the string.

Part 2: Caching

The cache is represented as a hashtable where the key is the query string. The element stored with the query should be an instance of a new class called SavedResult. A SavedResult contains another hash table storing URLs with their count and a String answer.

hashtable cache
-------------------------------------------------------------------------
|key:		|	|	|	|	|	|	|	|
|  String  	| 	|	|	|	|	|	|	|
|element: 	|  	|	|	|	|	|	|	|
|  SavedResult	|	|	|	|	|	|	|       |
|		| 	|	|	|	|	|	|	|
|		|  |	|	|	|	|	|	|	|
-------------------|-----------------------------------------------------
		   |
                   \/
		SavedResult   
	        ---------     ----------------------------------------
		| table |---->| URL, count  |    |    |    |    |    |	
	        ---------     ----------------------------------------
		| answer|----> String containing query result, or null if  
		---------      this entry was created to satisfy a larger  
                | best  |      query of which this is a part
                | match |
                | URL   |----> The best matching URL as a string (this is for 
                ---------      automatically displaying the best matching URL
                               in your Web browser gui)


1. For each word in query, create an alphabetically ordered query
   string that is all one case and the same case as you stored tokens
   into your WordFrequencyTrees

2. Search for the query in the cache
        (a) if a match is found
            check answer field of entry
                if null, then create answer based on table
                return result 
                and add best matching URL string to this entry as well
 	(b) if a match is not found
	     (A) create a hashtable for the query string
	         for each word in the query
                 (1) if it is in the cache then get its hashtable and 
		     merge its contents with the hashtable for the query
		 (2) otherwise
		    (a) create an entry for the word by processing the 
		        WordFrequencyTrees for the word and creating a 
		        hashtable of (URL, count) entries hashed on URL
			(the result field is null)
		    (b) merge this hashtable with the hashtable for the query
	     (B) create a result PQ from the entries in the query's hashtable
		 (priority is related to count field of URL), then use
		 this priority queue to create a String representation
	         of the query answer
	     (C) add the (query, (hashtable, answer)) to the cache
	     (D) return result to caller 


3. After each query print out:
	(a) whether or not the query was found in the cache 
	(b) if not, the number of query words that were found in the cache

The best place to implement caching is in your ProcessQueries class. However, if you use my ProcessQueries class then you will need to implement another class to perform the caching and make calls to my ProcessQueries class methods.

Think carefully about which implementation of the HashTable you plan to use for your cache: does it make a difference?

CLASSES

For this assignment you will start with your solution to assignment 7. If you did not complete assignment 7, then you can you can use our solution as a starting point for assignment 9.

Our solution .class files can be copied from: ~newhall/public/cs35/hw07/solution/

Documentation for all of these classes is in ~newhall/public/cs35/hwdoc/.

Our WebBrowser

In addition, you can try running our solution in:

~newhall/public/cs35/hw09/solution/

To run: java WebBrowser url_list

or: java WebBrowser url_list html_ignorelist

Try entering artificial intelligence in the search box and http://www.cs.swarthmore.edu/~cfk in the Fetch URL box and click on the corresponding buttons and see what happens.

Note: the links on the displayed webpage don't work. You do not have to add support to make links work. However, there is a way to add this functionality, and you can try it out if you'd like.

HAND IN

Using cs35handin, hand in a single tar file containing:

All .java files necessary for compiling your code (include any of the classes that we give you that you use in your solution).
A Makefile for building your code
A README file with:
1. Your name (and your partner's name if you had one)
2. The name of the class containing your main method
3. Answer to the following question:
  - What is the complexity of processing a query to produce the resulting ordered URL list?
    Compute the complexity for different cases such as when the query result is found in the cache, when part of the query is found in the cache, and when no part of the query is found in the cache. Explain how you computed the complexity of each case by providing an expression of each major component of the cost (make sure to clearly indicate what each term measures in terms of the algorithm).

If you work with a partner, please only one of you submit your joint solution using cs35handin.

CS35 Spring 2002 Homework 9: Adding a GUI front-end to the search engine and adding a cache of search query results Due: Thursday, April 25 before 2 am (late Wednesday night)

CONTENTS

PROBLEM DESCRIPTION

Part 1: GUI

Part 2: Caching

GETTING STARTED

Part 1: GUI

Part 2: Caching

CLASSES

Our WebBrowser

HAND IN

CS35 Spring 2002
Homework 9: Adding a GUI front-end to the search engine
and adding a cache of search query results
Due: Thursday, April 25 before 2 am (late Wednesday night)