CS21 Lab 8: Searching the Web

Due by midnight April 2

Run update21, to create the cs21/labs/08 directory. Then cd into your cs21/labs/08 directory where you will complete the program in the file webQueries.py.


Introduction

In this lab you will build a mini web search engine. You will read in a set of web pages based on their URLs. For each page, using linear search, you will build a sorted list of unique words that appear on that page and the number of times that they appear. Then, using binary search, you will respond to queries from the user by summing up the number of times each query word appears on each web page. The more times the query words appear, the more relevant the page. You will report the search results in relevance order, only showing those pages that have at least one occurrence of a query word. Here is a sample run of how your program will work.

In addition, you may optionally choose to create your own web page on the CS system and add it to the search engine. For example, a very basic web page has been created for the user lmas. It has a single sentence which reads "I like to play ultimate frisbee, soccer, and run track." Once this web page is added to the web search engine, here is another sample run demonstrating that this web page is now being searched as well.

Getting Started

Edit the file webQueries.py where you will implement your mini web search engine. Notice that the file includes several completed functions, some stubbed out functions, and some unit tests for these stubbed out functions. Read through the explanation below and follow the instructions highlighted in blue.

  1. The completed functions:

    • getURLs()
      This function generates valid URLs for a list of users on the the CS system and returns this list.

    • extractWords(url)
      This function opens and reads the web page associated with the given URL. It extracts the text from the web page, making all words lower case, removing words that are shorter than three characters, removing words that contain only digits, and removing punctuation.

    The main program includes a test of these completed functions:

    urls = getURLs()
    wordList = extractWords(urls[0])
    print(wordList)
    
    Test these functions now. Replace index 0 with the index of your cs21 instructor in the url list, and then execute the program to see the specific word list generated by your instructor's web page. If you go to your instructor's web page you will see that this list of words matches the words, in order, that appear on their home page (excluding the short and numeric words).
  2. The stubbed out functions:

    • findSortedPosition(word, sortedWordCounts)
      This function uses linear search to determine the correct location of a word in a sorted list. You may not use any of python's built in sort functions to accomplish this. The sorted list, called sortedWordCounts, contains sublists, where each sublist is a word and a count. For example, suppose that sortedWordCounts is the following list:
      [["apple", 3], ["banana", 1], ["mango", 2]]
      
      Notice that this list is sorted based on the dictionary ordering of the words: "apple" comes before "banana", which comes before "mango". Remember that a single index into such a list will give you the entire sublist:
      sortedWordCounts[1] # The sublist ["banana", 1]
      
      A double index into such a list will give you a component of the sublist:
      sortedWordCounts[1][0] # The string "banana"
      sortedWordCounts[1][1] # The count 1 
      
      The purpose of this function is to use linear search to find the correct index for the given word so that it will be placed in sorted order into the existing list. There are two cases you will need to handle:
      • If the word is already in the list, simply return its index.
      • If the word is NOT in the list, then use the list method insert(index, item) to insert at the index location a new item that is a sublist containing the word and a count of 0. Then return the index.

      Implement the findSortedPosition function now. Feel free to use the linear search code that we developed in class as the starting point for this function. You will need to adjust it to work with lists of lists and to stop when you find the correct location for a word that is NOT in the list. Look at the function called unitTests1. This contains test cases showing how this function should work. In the main program, call these unit tests to ensure that your implementation is correct.

    • countWords(wordList)
      This function takes a list of the extracted words from a web page and builds a sorted list of the unique words combined with a count of how often they appear. It uses the function findSortedPosition as a helper function. Initially, it sets the sorted list to the empty list. It then loops through the word list, incrementing the counter for that word in the sorted list. Remember that findSortedPosition will insert new words, with a count of 0, in the correct sorted position. So this function needs to access the sublist at the indicated index found by findSortedPosition, and update the count by 1 each time. It should return the final sorted list.

      Implement the countWords function now. Look at the function called unitTests2. This contains test cases showing how this function should work. In the main program, call these unit tests to ensure that your implementation is correct.

    • findWordCount(word, sortedWordCounts)

      This function uses binary search to look up the given word in a sorted list of words and their counts.

      Implement the findWordCount function now. Feel free to use the binary search code that we developed in class as a starting point. You will need to adjust it to work with a list of lists. Look at the function called unitTests3. This contains test cases showing how this function should work. In the main program, call these unit tests to ensure that your implementation is correct.

Once you have completed the stubbed out functions and they are passing the unit tests, you are ready to begin designing the main program and completing any additional functions you may need.


Designing the main program

Your program should produce output that looks like the sample run.

Use top-down design to think about the major steps the main program will need to complete (you do not need to send us your design). Then begin building the program incrementally. Here are some tips for how to proceed:

  1. Start by getting the list of URLs you will search. Then build the sorted word, count list for each URL. You will need a way to save each sorted list with its URL.

    Python has a data structure called a dictionary that would work well for this. A dictionary stores keys and associates a value with them. For example:

    d = {}              #create an empty dictionary called d
    d['a'] = 1          #store the key 'a' with the value 1
    d['b'] = 2          #store the key 'b' with the value 2
    for key in d:       #prints the keys and values in d
      print key, d[key]
    
    To keep track of the web page data, you can use a dictionary, where the keys are URLs and the values are sorted word, count lists.

  2. Once you have processed all of the web pages and stored their sorted word frequencies, you can then begin finding the answers to user queries. Continue requesting queries until the user decides to quit. For each query, check each URL for how relevant it is. Relevance is defined by the total number of times the query words appear on the web page. The more the words appear, the more relevant the page.

    You will need to show the results in sorted order from most relevant to least (be sure to exclude any results where the relevance score is 0). You are welcome to use python's built-in sort() method for lists. For example, if you create a results list structured like this:

    [[4, "url0"], [2, "url1"], [0, "url2"], [3, "url3"]]
    
    Then calling results.sort() will put the list in ascending order from lowest relevance to highest like this:
    [[0, "url2"], [2, "url1"], [3, "url3"], [4, "url0"]]
    
    You can then loop backwards through the list to report the best results first.

Optional: Create your own web page

Here are instructions for how to create your own basic web page to add to the search engine.

  1. Go to your home directory:
    cd
  2. Create a directory to store your web page (it must have the special name public_html):
    mkdir public_html
  3. Make the directory readable and executable by others:
    chmod a+rx public_html
  4. Move into this new directory:
    cd public_html
  5. Create the web page (it must have the special name index.html):
    emacs index.html
    Web pages can be quite fancy, but for now just make a very basic page as described in these html examples. Put some unique text about you on your page, save it, and exit emacs when you are done.
  6. Make your web page readable by others:
    chmod a+r index.html
  7. Test if you can see your page in a browser by typing in the address:
    www.cs.swarthmore.edu/~yourUserName
    That special character in front of your username is called a tilde. It looks like a curly dash on the keyboard.
  8. If you want to edit your web page, just use emacs to re-open it and change whatever you like, and then save and exit emacs.
  9. Once you are happy with your web page you can add your username to the list of usernames in the webQueries.py program that will be used to retrieve web pages for the search engine. Then re-run the program and try finding information that you included on your web page.

Submit

Once you are satisfied with your program, hand it in by typing handin21 in a terminal window.