Data Structures and Algorithms

Lab 8: King James Programming

Due on Sunday, April 3rd at 11:59 PM. This is a team lab. You and your assigned lab partner(s) will complete this lab together. Make sure that you are familiar with the Partner Etiquette guidelines. You may discuss the concepts of this lab with other classmates, but you may not share your code with anyone other than course staff and your lab partner(s). Do not look at solutions written by students other than your team. If your team needs help, please post on the Piazza forum or contact the instructor or ninjas. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

Robot

In this lab, you will modify an implementation of a binary search tree to create an AVL tree. Operations on AVL trees have better worst-case time complexity than operations on simple binary search trees. We will make use of these operations to construct a generator which, given a collection of input files, can create novel-looking, fairly realistic sentences. This lab is inspired by and named after the single-topic blog King James Programming, which uses similar (albeit more sophisticated) techniques to achieve a similar goal.

In comparison to previous labs, the number of lines of code you must add to the starting repository to get a completed, correct submission is very small. Those lines of code are conceptually weighty, however, and you can expect to spend a fair amount of time debugging your AVL tree data structure. Make sure to test thoroughly!

Your team’s repository can be found in the usual location: git@github.swarthmore.edu:cs35-s16/lab08-<team-name>.

Starting Code

The starting files for this lab include a complete implementation of a binary search tree. The following is a description of each file in the starter code; files you will need to modify are shown in bold.

Using Your Previous Code

Recycling Symbol

If you prefer, you may use the code you submitted for Lab 7 in this assignment. If you do so, you should bear the following in mind:

For most groups, it’s probably easier to read over the correct implementation of the BST provided in this lab than it is to adapt Lab 7 code for this assignment.

Implementation Strategy: The AVL Tree

The implementation of this lab can be accomplished in the following steps:

Blueprints
  1. Add a height field to the AVLTreeNode class to track nodes’ heights. * You’ll need to modify the constructor and the methods of AVLTreeNode to use this field and to keep it up to date. * The new height-related helper methods (such as left_subtree_height and recalculate_height) will be helpful here. * Specifically, recalculate_height should be called on any node whose height may have changed. * subtree_height should get much simpler! * Make sure to test your new height implementation. You should be able to copy tests from Lab 7 to help you make sure that e.g. subtree_height still works correctly.
  2. Add calls to rebalance_self to each place in the AVLTree where nodes are added or removed. * rebalance_self should be called on every node between the root and the modified node. This includes the places where e.g. subtree_put recurses. * For now, just assume that rebalance_self does the right thing: it rebalances the node if the node needs rebalancing. You’ll implement that method in a moment.
  3. Implement rotate_left and rotate_right. Then, implement rebalance_self. * Write this code very carefully; it may help to draw out diagrams of what you think is happening in each case. * Remember to consider the different cases of rebalancing. For each line of code, you should be able to describe which cases it handles. Are you in the LL case? The RL case? * Make sure to test this behavior thoroughly! It’s key to making the data structure perform correctly.
    • Write out valid trees.
    • Determine how they need to be rebalanced once a mapping is added or removed.
    • Write code to create AVLTreeNodes in the shape of that tree and perform the addition or removal.
    • Write tests to make sure the resulting AVLTreeNodes have the right keys and children in the right places.
    • Don’t forget to test cases in which the tree doesn’t rebalance!
  4. Implement check_invariants. * This method does nothing but check to make sure that the tree obeys the invariants we have established. * Call it from within your tests to verify that nothing has gone wrong with your tree operations!

Generating Sentences

The AVL tree implementation is the majority of the work in this lab due to how sensitive the implementation is. We’ll now use that structure to do something interesting: generating sentences from example data. We’ll accomplish this by reading several text files which have examples of sentences (most of which were obtained from Project Gutenberg) and teaching the computer to recognize basic patterns in the sentences it sees.

Trigrams

The Number Three

Broadly speaking, a trigram is a group of three ordered elements. Here, a trigram is three words that have been discovered in order in a document. For instance, the sentence

We have a lot of fun in the CS department.

contains trigrams like “have a lot”, “lot of fun”, and “the CS department.” Since we’re focusing on sentence generation, we’ll also imagine that the start and end of the sentence contain invisible symbols that begin or end sentences; that is, the above text also contains the trigrams “[[START]] We have” and “CS department. [[END]]”. We’ll use these invisible symbols to know where to start when generating sentences and when to stop adding more words. In our code, we’ll use the variables START_SENTENCE and END_SENTENCE rather than writing these strings by hand.

Generating Sentences with Trigrams

We can use trigrams to get a rough idea of which words can follow which other words in a sentence. From the above, we learn that the word “lot” can follow the words “have a” and that “department.” can follow the words “the CS”. Imagine that we combined the trigrams from the above sentence with the trigrams in the following sentence:

I saw a lot of penguins in the newspaper.

This tells us that “penguins” can follow “lot of” and that “in” can follow “of penguins”. We also knew from the previous sentence that “CS” can follow “in the”. Combining everything that we know about the trigrams in these two sentences gives us reason to believe that we can write

We have a lot of penguins in the CS department.

since every trigram in that new sentence can be found in one of the previous two sentences. We can visualize how each pair of words leads to a third word as follows:

The TrigramModel Class

To keep track of trigrams, you’ll implement the TrigramModel class that has been provided for you. This class has two important fields. One keeps track of “starting words”: those words we have seen immediately follow the “[[START]]” invisible symbol. We know that each starting word may start a sentence. This is merely a list of strings.

The other field keeps track of “following words”: those words which may follow two other words. To keep track of this, we use an AVL tree where keys are pairs of strings (the first two words in the trigram) and values are lists of strings (the possibilities for the third word in the trigram). We encapsulate this reasoning within the TrigramModel class so that we don’t have to remember how the TrigramModel keeps track of all of this information later.

Training the TrigramModel Class

After you implement the TrigramModel class, you will need to “train” it. Training is the process of providing information to this object so that it can be used to generate sentences.

You have been provided a function, read_strings_from_file, in kjp.h/kjp.cpp which will produce a list of all of the strings the named file. It will also add START_SENTENCE and END_SENTENCE symbols wherever it believes that sentences start or end. (This function is not perfect, but it’s sufficient for our uses in this lab.) Once you’ve read the list of strings from a file, you only need to call the add_sequence method of your TrigramModel for each trigram that appears in that list. This should cause your TrigramModel to learn which words can follow which other words.

Generating Sentences from a TrigramModel

Typewriter

Once you’ve trained your TrigramModel, the algorithm for generating a sentence is fairly easy:

  1. Create an empty string accumulator.
  2. Pick a random first word. Add it to the string accumulator.
  3. Remember this first word as the second position in a trigram; the first position of the trigram is initially START_SENTENCE.
  4. Pick a new random word for the third position of your trigram.
  5. As long as your trigram’s third position is not END_SENTENCE: 1. Add that symbol to your sentence accumulator. 2. Move each word of the trigram one position to the left (discarding the first word). 3. Pick a new random word for the third position.

Your Command-Line Interface

On the command-line, your program should accept a number of sentences to generate and any number of data files. (The user must provide at least one.) For instance, the command

./kjp 4 test_data/Moby_Dick.txt

should train a TrigramModel on the contents of Moby Dick and then generate four sentences. The command

./kjp 4 test_data/Moby_Dick.txt test_data/Crime_and_Punishment.txt

will train a single TrigramModel on both Moby Dick and Crime and Punishment and then generate four sentences. In order to convert the string "4" to the integer value 4, you can use the stoi function. This function throws an exception if the number is invalid, which you should be sure to catch so you can give an appropriate error message and exit your kjp_main function gracefully.

The test_data directory contains a variety of documents for you to use, although some excellent results can be found by combining Pride and Prejudice with Fairy Tales by the Brothers Grimm:

$ ./kjp 3 test_data/Pride_and_Prejudice.txt test_data/Fairy_Tales_by_The_Brothers_Grimm.txt

Then she asked the true golden bird; and it was an inexhaustible subject of Bingley.

His reception, however, was dead.

"I beg your pardon, madam, for your wife.'

Bear in mind that the technique we are using is not a clever one and so you’ll often wind up with grammatically incorrect or otherwise meaningless sentences:

$ ./kjp 3 test_data/Pride_and_Prejudice.txt test_data/Fairy_Tales_by_The_Brothers_Grimm.txt

On the way she was already dark night.

Collins.

Why then does this place him!"

Don’t worry about this; you can learn more sophisticated techniques in other classes. If you want to test your sentence generation logic, try combining the two very simple files in your test_data directory and see if you get their different permutations:

$ ./kjp 8 test_data/simple_document_1.txt test_data/simple_document_2.txt

Here is a language.

C++ is a language.

Here is a sentence.

C++ is a sentence.

Here is a sentence.

Here is a language.

C++ is a language.

C++ is a language.

Testing

You are not required to test your program with e.g. RUN_MAIN as in previous labs, but you should still write tests that exercise the functionality of TrigramModel. Some tests have been provided for you which should do this. You are also required to have a rich set of tests for your AVLTree and AVLTreeNode classes; indeed, those tests should be written before you move on to the sentence generation.

Memory

Your program is required to run without any memory leaks or errors; you must free up all of the memory that you allocate and you must only use memory that you have allocated and initialized. We will use valgrind to determine if you have done this correctly, so you should as well. You should also consider using valgrind to find bugs in your program as you proceed; it’s one of two very powerful C++ debugging tools that we’ve discussed.

Coding Style Requirements

You are required to observe some good coding practices:

Well-dressed people (style)

Peer Review

As this is a team assignment, you are required to complete a peer review of your lab partner or partners. You must do this even if everything is fine. You should complete this peer review after you have turned in your assignment. The peer review comes in the form of a simple Google Form and can be accessed here; in most cases, it will likely take less than a minute to complete. If you have trouble accessing this peer review, please make sure that you are logged into Google using your Swarthmore credentials.

Your peer review will be kept private; only course staff will have access to this information. You can also update your peer review after you have submitted it. However, if you do not submit a peer review, you will lose points on your lab. Please don’t forget!

Summary of Requirements

When you are finished, you should have