The main goals for this lab are to gain experience working with
hash tables and to evaluate real-world data structure
performance by running experiments. To that end, you will:
- Implement a hash table using separate chaining.
- Apply your hash table implementation to create a simple
program to
assist Scrabble
players.
- Compare the performance of your hash table implementation
to that of a BST by running a series of experiments.
As with most assignments for the rest of the semester, you will
be working with a partner.
Both partners should be present and working on the
code together. You will both be responsible for
understanding all concepts, so dividing and conquering is not an
option. The academic
integrity policy applies to the entire pair; you cannot work
or share code with anyone outside your partner.
You and your lab partner will share the same git repository,
which is named lab08-<user1>-<user2>.git.
Please access your lab by cloning the appropriate git
repository, e.g.:
$ git clone git@github.swarthmore.edu:CS35-s17/lab08-jbrody1-adanner1.git
Starting Code
Below is an overview of the files required for submitting this lab.
Those highlighted in blue will require implementation
on your part. Those highlighted in black are
complete and should not be modified except for comments.
- adts: A directory containing the C++
declarations of the ADTs you will use for this lab,
including dictionary.h.
- Makefile: the build instructions for this
lab.
- hashTable.h, hashTable-inl.h: the hash table
implementation files.
- hashFunctions.*: Simple hash functions for
integer and string keys, similar to hashes we saw in class. If
you decide to introduce new hash functions, add
them here.
- tests.cpp: Unit tests for your hash table
implementation.
- main.cpp: a simple Scrabble Assistant
application.
- scrabbleAssistant.*: a C++ class encapsulating
the functionality of the Scrabble assistant.
- WrittenLab.tex The $\LaTeX$ source for your
experimental analysis portion of the lab.
- experiment.cpp A program that runs
experiments comparing performance of different dictionary
implementations (hash table, binary search tree). We've coded the
experiment framework for you. You just need to implement a single
function in:
- experimentUtil.* This file contains a single
function called getExperimentKeys(int num_keys). The
experiment framework calls this function to get the keys to put
into the dictionaries during testing.
- experimentClasses.*, experimentHashes.*:
special C++ classes (and hash functions for those classes) you'll
use to run your experiments.
- manualTests.cpp: A "sandbox" test program that
you can use while you develop your program.
- stlBST.h, stlList.h, stlPriorityQueue.h,
stlQueue.h, stlStack.h: C++ standard template library
implementations of abstract data types.
Part I: Hash Table Implementation.
For the first part of this lab assignment, you will implement a hash table.
Unit tests have been provided; once your hash
table implementation is complete, these unit tests should pass.
Below are some implementation details and suggestions:
- Your hash table implementation must use separate chaining.
- Consider using an array of statically allocated
vectors for the buckets in your hash table. You could also
use a List or even a balanced BST. However, the
difference in performance should be minimal because you should aim
to have $O(1)$ elements in each bucket on average.
- Similar to an ArrayList, you will need to expand
capacity when there are too many elements in the hash
table. However, what counts as "too many elements" is different
with hash tables. Define the load factor of a hash table
as the number of elements in the table, divided by its capacity
(i.e., the numer of elements in the array). Your implementation
should expand capacity whenever the laod factor exceeds some
maximum load factor threshold.
- The HashTable class has two constructors. One allows the user
to select a maximum load factor; the other should provide a
reasonable default value (say 0.75).
- In either case, select a reasonable default table capacity, say capacity=5.
Expanding Capacity
If the load factor
ever exceeds the maximum load factor of your hash table, you will need
to expand the table. This is conceptually similar to expanding
capacity for
ArrayList or a
BinaryHeap. There are a
couple of differences. One is that
when to expand
differs here -- it depends on the load factor, not on whether the
array is full. A more subtle difference is that it's not enough to
just copy table entries over. Each key/value pair in the hash table
needs to be rehashed, because which bucket it goes into depends on the
capacity of the table itself! Below we itemize what tasks need to be
done to properly expand capacity.
- First, copy the oldCapacity and oldTable somewhere safe
- Second, double the capacity, and allocate a new table with the
new capacity.
- Third, go through each bucket in the oldTable (if only you had
saved a pointer to this somewhere safe!)
- For each bucket, look at each entry in that bucket's vector.
Rehash the key with the new capacity, and insert into the
corresponding bucket in the new table.
- Finally, delete the oldTable.
Remember, your hash table should
never have load factor
higher than the maximum load factor.
As soon as this happens,
you should expand capacity.
Removing From a vector
The
vector
class provides a means to remove an element from the
end
of a list (
pop_back) but doesn't give a clear interface for
removing elements from the middle. However, vectors to provide a general method called
erase whih can delete entire contiguous regions:
vector myVector = ...;
myVector.erase(myVector.begin()+start_index, myVector.begin()+end_index);
The above function works much like a Python slice: the start index is included, but the end index is not. e.g. erasing with start/end indices (1,3) would remove the
second and
third indices (i.e. indices 1 and 2) from the vector. If you want to remove a single element from the vector, you can do something like:
vector myVector = ...; // whatever defines/initializes the vector
int index = ...; // the index of the element you want to remove
myVector.erase(myVector.begin() + index, myVector.begin() + index+1);
For more information, consult the
vector documentation in
cplusplus.com.
Part II: Scrabble Assistant.
Once you have a working hash table, you can start writing your
Scrabble Assistant. We'll design the application in two parts:
a ScrabbleAssistant class and a
separate main program. The main program (which
we've written for you) provides a simple menu interface and makes
calls to the ScrabbleAssistant, which encapsulates all of the
Scrabble Assistant features. The ScrabbleAssistant class should
load an English-language dictionary (allow the user to specify a
dictionary, but use /usr/share/dict/words as a default) into
a hash table and support the following functions:
- spellCheck: given a word, this function should
return true if and only iff the word is in the
dictionary. Implementing this should be straightforward and
should run in O(1) time.
- anagramsOf: given a string of letters, this
function returns a vector of all anagrams (i.e., permutations of
these letters) which represent valid words For example, "cat"
and "act" contain exactly the same letters. This function
should also take constant time (aside from the cost
of creating or copying the vector of anagrams).
- findWords: given a string of letters, this function
returns a vector of all legal words the player could spell with
the given letters.
Implementing
anagramsOf and
findWords will
require some algorithm design. You're welcome to include any data
members you need for the ScrabbleAssistant, as well as any private
helper functions you'd like.
In order to make anagramsOf run in constant time, it
will be helpful to preprocess the word list in the
ScrabbleAssistant constructor and maintain a table of anagrams.
First, write a helper function to sort the letters in a
string. (e.g. sortString("cat") should return "act"). You don't
have to do this yourself; instead, use the std::sort
function provided in the algorithm library. Then,
create a dictionary that maps each sorted tring to a vector
of all words that sort to that string.
anagramsOf should provide all anagrams of a given set
of letters, but won't be enough to provide all legal plays given a
set of letters. To do this, you'll also need to evaluate all
legal plays using fewer letters as well.
Implementing findWords will require some clever design.
One way to implement findWords is to break down the
task into the following subtasks:
- powerSet: given a string, this should return
the power set of the string; that is, it should
return the set of all subsets of the given characters. The
following pseudocode implements powerSet recursively:
Function stringPowerSet(string)
If string length is 1 Then
Return ["", string]
Else
firstChar ← string[0]
substring ← string, except without first character
smallerSet ← stringPowerSet(substring)
result ← []
For Each string In smallerSet Do
result.add(string)
result.add(firstChar + string)
End For
Return result
End If
End Function
- Eliminating Duplicate StringsIf you use
the powerSet function, you will likely find at some
point that you have a vector of words that contain several
duplicates. It might be helpful to create a removeDuplicates function that eliminates any extra copies of a word. There are several methods of accomplishing this:
- Use a double-for loop that compares all pairs of words in the vector, looking for pairs that are the same. This will run in $O(n^2)$ time.
- First, sort the vector of words. Then,
make a linear scan over the list, eliminating adjacent
duplicates. This will take $O(n \log n)$ time using an
efficient sorting algorithm.
- Create a hash table with words as keys. Then, insert
each word in the hash table assuming it isn't already
there. This should take $O(1)$ time per word, or $O(n)$
overall time.
- Once you've written your powerSet and removeDuplicates function, you can combine them to find all legal words as follows:
- Find the power set of the given string
- Get all anagrams of each string in the power set
- Accumulate all the anagrams in a single vector
- Remove Duplicates
Once the above code is complete, your
ScrabbleAssistant will be complete!
Part III: Dictionary Experiments.
The Experiment
For the final part of this lab, you'll run some experiments comparing
the performance of several dictionary implementations, including a
BST, your hash table using a good hash function, and your hash table
using a suboptimal hash function. The experiment will add some number
of elements to each dictionary and then retrieve them. It reports the
average runtime for each insert and get operation.
There are two classes used in this experiment, both defined
in experimentClasses.
The StringIntGoodHash and StringIntBadHash classes
both contain a string and an int data member. These classes are
identical except for the hash functions, given
in experimentHashes. It is your responsability
to understand how these hash functions differ and exploit these
differences in the experiment.
Your Tasks
Most of this code has been written for you
in
experiment.cpp. Your only implementation task
is to implement
getExperimentKeys function
in
experimentUtil.cpp. This function is used by the
experiment to determine which keys to add to the dictionary. Your
task is to create a set of keys tailor-made to cause several
collisions for
StringIntBadHash. Note that every
key returned by
getExperimentKeys must be distinct;
the experiment will not run if you create duplicate keys.
Once you have written your getExperimentKeys
function, use the experiment program to produce a graph
much as you did for the mystery functions program
from Lab03. This program does not require
any arguments, so you can run it as e.g.:
$ ./experiment | gnuplot
This will both save a file
called
experiment.png as well as open a new window
showing you the results of your experiment. If you successfully
exploited the bad hash function, the runtime of the bad hash
experiment should be comparitively high.
After completing your experiment, create a brief
written summary that contains:
- Your experimental graph
- A description of how your getExperimentKeys
function exploits the bad hash function.
- A brief discussion of why the graph looks like it does, in
terms of big-O complexity of the different dictionary
implementations.
A couple of paragraphs should be sufficient. Please include a
.tex or .pdf of your writeup, as well as the png file you created.
If you provide .tex only and the grader cannot compile your
writeup, you will receive no credit for the writeup.
Memory Management
Your program is required to run without memory
errors; run valgrind to eliminate them. For this
lab, memory leaks are not a priority. Small
memory leaks are acceptable and will not significantly affect your
grade. While "lost" memory should not occur in large amounts
(e.g. millions of bytes), we encourage you to not spend hours and
hours tracking down every last memory leak.
Commenting and Coding Style Requirements
Graders will assign minor penalties for poor commenting and coding
style.
Please review the
CS35 Coding
Style for good C++ style.
Survey
When you have completed your lab, answer a few short survey questions
in the file
README.md.
Summary
When you have completed your lab, you should have
- A working hash table implementation using separate
chaining.
- A working Scrabble Assistant.
- A submitted PDF of your final WrittenLab, containing analysis of
the experiment(s) you ran.
- No memory errors!
- Little to no memory leaks!
- Code which conforms to the style guide above.
- A completed survey in README.md
Submit
Once you are satisfied with your code, hand it in
using git. Remember the add, commit, push
development cycle. You can push as many times as you like, but only
the most recent submission will be graded. You may want to
run git
status to confirm all modifications have been pushed.