Data Structures and Algorithms

Lab 7: Word Count

Due on Sunday, March 27th at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

For this lab, you will implement binary search tree (BST) data structure; you will then use it to implement a word-counting tool. Like the last team lab, your repository URL will be git@github.swarthmore.edu:cs35-s16/lab07-<team-name>. You can see a list of all of your repositories on the Swarthmore GitHub in the lower-right side of the main page; make sure to switch your view to the organization for this class to see them.

Starting Code

Abacus

This lab contains several starting files. The bolded filenames below are the files you will need to change to complete the lab.

Implementing a Binary Search Tree

Most of the work of this lab is implementing the binary search tree, which will happen in linked_bst__definition.h. The LinkedBST class is a template class with two templated type names: K (the type of the keys in the BST) and V (the type of the values). The LinkedBST class itself tracks only two pieces of data:

A single node stores information about its key, its value, and its left and right children. If a node does not have a child, then the corresponding pointer variable will be set to NULL.

Many of the methods in the binary search tree are natural to implement recursively. For this reason, there are a great many private methods in the starter code which word over some arbitrary subtree (represented by the top node in that subtree) which can call themselves. For instance, the height of a node is equal to one more than the largest of the heights of its two children. (The max function can be found in the C++ algorithm library.) Most of the public methods of LinkedBST will be single-line calls to these private, recursive methods to begin the operation at the BST’s root node.

Implementation Strategy

Blueprints

Although most of the methods in this lab are fairly simple, there’s quite a lot to do. Below is an approach to writing these methods which should organize your efforts fairly well. You’re not required to use this approach, but it might help.

  1. Get started on LinkedBSTNode. 1. Implement the constructors and destructor of LinkedBSTNode. Also implement the getters and setters. This should be very simple and easy to complete quickly. 2. Implement the subtree_height method of LinkedBSTNode. This will be a good introduction to writing recursive helper methods. 3. See if the provided unit tests for subtree_height method pass now. Make sure to read over those provided unit tests; they build a tree of LinkedBSTNode objects and then check properties of them. You’ll need to write several such methods of your own. 4. Implement the subtree_min_key and subtree_max_key methods. Write tests for them. 5. Implement the find_node method. Use find_node to implement subtree_contains and subtree_get. 6. Write unit tests for subtree_contains and subtree_get. Don’t implement any more methods until these are working. 7. Implement and test subtree_put. Note that this method is not as simple as calling find_node: the key might not exist anywhere in the tree yet. You’ll have to find the appropriate place to put the data and, if necessary, create a new LinkedBSTNode for it.
  2. Switch to LinkedBST and use the methods you have written. 1. Implement the constructor, destructor, and get_size and is_empty methods. These should be trivial. 2. Implement get_height, get_min_key, get_max_key, contains, and get. These should be easy to write and very similar to each other; each just checks to see if the BST is empty and either performs a default action (if the BST is empty) or calls one of the methods of LinkedBSTNode (if the BST is not empty). 3. Implement the put method. This is similar to the above, but be sure to use the return value of the helper function to update your BST’s size appropriately. 4. Write tests for the above methods. Unlike your LinkedBSTNode tests, you should now be able to create the LinkedBST within your unit tests, using the e.g. put method to create the LinkedBST (instead of creating the tree by hand).
  3. Write the remaining methods simultaneously. 1. Implement find_parent_node on LinkedBSTNode and write some simple tests for it. Test it with several keys: the root’s key, other keys in the tree, and keys that are not in the tree. 2. Implement subtree_remove on LinkedBSTNode. This is probably the hardest method to implement in this lab; make sure to consider each of the different cases of removal (key not found, key found and node has one child, key found and node has two children, etc.). 3. Test subtree_remove. Make sure you write code that triggers each of the different removal cases. 4. Implement remove on LinkedBST using subtree_remove. There isn’t a lot of work to do here, but you have to handle the empty tree and singleton tree cases specially. 5. Test remove on LinkedBST. 6. Implement each of the traversal methods on LinkedBSTNode. 7. Implement and test each of the traversal methods on LinkedBST. 8. Implement the level-order traversal of LinkedBST. (There is no LinkedBSTNode helper method to go along with it.)
  4. Now that you’ve implemented the LinkedBST, use it to write your word-counting program.

Traversals

The LinkedBST class must provide implementations of traversal methods: methods which allow us to see every piece of data in the tree. A traversal is given by a list of key-value pairs; we will represent such a pair as a Pair object. A Pair is just a container that stores two pieces of data. For instance, a Pair<int,string> is just an object with two fields (an int and a string) and appropriate getter and setter methods. Since a traversal is a list of such pairs, we write the return value of the traversal methods as List< Pair<K,V> >* (where K and V are the key and value of the LinkedBST).

Example BST

The BST interface supports four different kinds of traversal:

The first three forms of traversal are quite similar and are natural to define recursively. The level-order traversal is somewhat different and requires a breadth-first search approach similar to that which you used in the previous lab.

Wikipedia has some nice pictures of these traversals for reference. Note that your program is adding the elements to a list rather than “displaying” them as described on that Wikipedia page.

A note about nested template types

You may have observed that the traversal type above was written List< Pair<K,V> >* with spaces around the “Pair” type argument. This is intentional; if we remove the spaces (“List Pair<K,V>>*”), the C++ parser can get confused about whether >> is two closing angle brackets or a single >> operator (as with ifstream). We include the spaces to make sure this doesn’t happen.

Counting Words

The interface for your word-counting program will be quite simple:

wordcount <file-to-read> <result-file> <traversal>

Here, <traversal> is one of pre, post, in, or level. Your program should read each word in the input file, use the LinkedBST to tally up how many of each word it contains, and then write the resulting traversal to an output file with one pairing per line. For instance, suppose document.txt contains the following:

It is easier to ask forgiveness than it is to get permission.

If we run

wordcount document.txt results.txt in

then we would expect results.txt to contain:

1 ask
1 easier
1 forgiveness
1 get
2 is
1 it
1 It
1 permission
1 than
2 to

Note that it and It count as different spellings; you should not change capitalization for this assignment. Note also that permission does not include a period at the end of the word; you should eliminate any non-alphabetic characters from the beginning and end of each word as you read it from the file. You can use the substr method of string to accomplish this.

Testing

As usual, you have been provided RUN_MAIN and CHECK_FILES_EQUAL. You should write tests in tests.cpp for your LinkedBST data structure as well as for your wordcount program.

Memory

Your program is required to run without any memory leaks or errors; you must free up all of the memory that you allocate and you must only use memory that you have allocated and initialized. We will use valgrind to determine if you have done this correctly, so you should as well. You should also consider using valgrind to find bugs in your program as you proceed; it’s one of two very powerful C++ debugging tools that we’ve discussed.

Coding Style Requirements

You are required to observe some good coding practices:

Well-dressed people (style)

Peer Review

As this is a team assignment, you are required to complete a peer review of your lab partner or partners. You must do this even if everything is fine. You should complete this peer review after you have turned in your assignment. The peer review comes in the form of a simple Google Form and can be accessed here; in most cases, it will likely take less than a minute to complete. If you have trouble accessing this peer review, please make sure that you are logged into Google using your Swarthmore credentials.

Your peer review will be kept private; only course staff will have access to this information. You can also update your peer review after you have submitted it. However, if you do not submit a peer review, you will lose points on your lab. Please don’t forget!

Summary of Requirements

When you are finished, you should have