Lab 8: Efficient Keyword Search

Due on Wednesday, November 28 at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

We will be using Teammaker to form teams. You can log in to that site to indicate your preferred partner. Once you and your partner have specified each other, a GitHub repository will be created for your team. If you have any trouble using Teammaker, contact your instructor.

Overview

Imagine that you are reading a large PDF document, such as our cs35 textbook by Shaffer, and you want to find the most relevant pages where priority queues are described. If you search for those keywords in the document, you will most likely be given a sequential list of pages where they occur, rather than a prioritized list, where pages with the most occurrences are first.

In this lab you will write a program that reads in documents, and then repeatedly searches the document for particular keywords. For each search, the program efficiently finds the ten most relevant pages that contain that keyword.

To accomplish this you will build two new implementations of the Dictionary ADT:

  1. LinearDictionary
  2. HashTable

Then you will create a KeywordSearcher class that employs your Dictionary implementations along with an existing implementation of a PriorityQueue to quickly search documents for keywords.

Starting Code

Your starting code can be found in the appropriate repository for your team. The following is a description of the repository’s initial comments; bolded files are those which you must change in completing the lab.

Part I. Implementing LinearDictionary

We recommend that you implement a LinearDictionary as a vector of key, value pairs. All of the operations should run in \(O(n)\) time. You may be thinking: “Why would I want such an inefficient implementation of the Dictionary ADT?” The answer is that, for very small dictionaries, simpler implementations perform much better as they have a smaller constant overhead.

We will use the LinearDictionary to represent the chain of collisions in our HashTable. Because we expect each chain to be relatively short, the LinearDictionary provides a nice abstraction that will be efficient enough for our purposes.

Unit tests have been provided for this lab; once you have completed your LinearDictionary implementation, the appropriate unit tests should pass.

Removing from a vector

In order to implement the remove operation of your LinearDictionary, you’ll need to learn about how to remove from a vector.

The vector class provides a means to remove an element from the end of a list (pop_back) but doesn’t give a clear interface for removing elements e.g. from an arbitrary index. Vectors provide a general function called erase which can delete entire contiguous regions:

vector<int> vec = ...;
vec.erase(vec.begin() + start_index, vec.begin() + end_index);

The above function works much like a Python slice: the start index is included but the end index is not. Thus, remove_range_from_vector(1,3) would remove the second and third elements (indices 1 and 2) from the vector. If you want to delete a single element from a vector, you can do something like this:

vector<int> vec = ...; // whatever defines this vector
int index = ...; // e.g. 4
vec.erase(vec.begin() + index, vec.begin() + index + 1);

Part II. Implementing HashTable

Hash Table Structure

Once your LinearDictionary implementation is complete, you can begin implementing a chaining hash table as we discussed in class. (You will not implement a linear probing hash table.) The figure at right shows an example of a chaining hash table with size 3, capacity 4, and storing integer keys with string values.

Here are some important details about this class:

Once you have completed your HashTable implementation, the appropriate unit tests should pass.

Part III. Keyword Searcher

$$@@$$PAGE: 1
cows say moo
$$@@$$PAGE: 2
dogs say bark
$$@@$$PAGE: 3
say do trees say bark
cows
11
do
31
dogs
21
trees
31
moo
11
bark
21
31
say
11
21
32
Document test_data/tiny.txt
KeywordSearcher

Once your implementations of the LinearDictionary and HashTable are complete, you can begin implementing the class KeywordSearcher. The figure at right shows an example document that is available in the test_data directory called tiny.txt. Notice that each new page is marked by a special set of characters: $$@@$$PAGE:. You can assume that all documents that you read in will follow this format. The figure also shows a conceptual view of the kind of data structure that your KeywordSearcher should construct. This class will use a hash table keyed on the unique words within a document. Associated with each word is another structure that tracks the page numbers where that word occurs and the number of times it appears on that page.

Here are some important details about this class:

Once you have completed your KeywordSearcher implementation, the appropriate unit tests should pass.

Part IV. Main Program

The final part of your lab is to implement the main program, which will read in a document as a text file, and allow the user to repeatedly search the document for particular keywords. For each keyword search, it returns the top ten pages within the document that contain that keyword the most times. If the word doesn’t appear on at least ten pages, then it should return as many pages as it can. It should also gracefully handle the case where the word does not appear at all within the document.

For instance, here is a sample run of the program done on the the contents of the textbook for this class:

./keywordSearch test_data/DataStructuresAndAlgorithmAnalysis.txt
File loaded

Please enter a search word or '!' to quit: priority
The word priority appears in the file...
   on page 220 (11 occurrences)
   on page 197 (8 occurrences)
   on page 204 (7 occurrences)
   on page 206 (5 occurrences)
   on page 421 (5 occurrences)
   on page 612 (3 occurrences)
   on page 422 (2 occurrences)
   on page 203 (2 occurrences)
   on page 425 (2 occurrences)
   on page 543 (2 occurrences)

Please enter a search word or '!' to quit: queue
The word queue appears in the file...
   on page 150 (23 occurrences)
   on page 148 (22 occurrences)
   on page 149 (12 occurrences)
   on page 152 (9 occurrences)
   on page 153 (8 occurrences)
   on page 151 (8 occurrences)
   on page 417 (7 occurrences)
   on page 147 (7 occurrences)
   on page 166 (5 occurrences)
   on page 220 (5 occurrences)

Please enter a search word or '!' to quit: swarthmore
The word swarthmore does not appear in the file.

Please enter a search word or '!' to quit: !

Goodbye!

Memory

For this lab, your program is required to run without memory errors or leaks. You should use valgrind as you proceed in your testing to track memory errors. When you have a complete first draft of your implementation:

Coding Style Requirements

You are required to observe some good coding practices:

Well-dressed people (style)

Summary of Requirements

When you are finished, you should have