Please read through the entire lab before starting! More than most other labs you've done so far this semester, you will benefit from reading this lab through completely before beginning.
As always, run update21 to create your
cs21/labs/08 directory and create your program for lab 8 in this directory. Your code will be stored in a file called
Note that this is part one of a two-part lab, but each part will be turned in and graded separately.
The goal of the two-part lab is to learn how to do some basic comparasions between literary texts. We will start in Part 1 by working with a single text, learning how to read in the text file, count the words in the file, and show the context in which specific words occur. In Part 2, we will work with two texts, examining the most common words in both texts and identifying the words that are much more "standout" or "distinctive" in one text than the other. If we were literary scholars, we might be able to use the output of this program to make speculative literary arguments about these texts. We could then examine the texts more closely by seeing how each of these more-"distinctive" words are used in the context of each text.
Although we may not be literary scholars, this lab is being done in conjuction with English 35: Rise of the Novel, taught by Professor Rachel Buurma. In ENGL 35, Professor Buurma's students will be using the solutions you write to examine texts that they have been reading in their class and compare them to a larger body of texts that they haven't read. They will interpret the similarities and differences they notice in order to construct an argument about these characteristics in the context of theoretical and historical approaches to the eighteenth- and nineteenth-century novels. They'll also gain some insight into the computational processes though which these textual features were extracted and what the assumptions, limitations, and further possibilities of such computationally-created representations of literary texts might be.
In part 1 of the lab, you will use the searching techniques that we have learned about in class to quickly count how often every word in the corpus occurs. Once you've counted all words in the corpus, you will report the most frequent word in the corpus and its count. Then, you will ask the user to enter a word that they'd like more information about. Your program will tell the user how often that word occurred and then show them each of the occurrences of that word in the context that it appeared in the text. You will continue to ask the user for more words to show until they press Enter without inputting a value. At that point, the program ends. Here are some examples:
You have some freedom in how you want program to look. Here are our requirements:
Although you will not be submitting your top-down design (TDD) this week, we strongly encourage you to write your full design first. Design your
main() function and any helper functions. Your
main() function should be relatively short and easy to read. Implement and test the program one function at a time. A good program design and good use of functions is part of the grade.
/data/cs21/novels, you will find a lot of text files. (To list these files, type
ls /data/cs21/novels.) Most of the files in this directory are novels. For example,
shelley_frankenstein.txt is Mary Shelley's 1818 novel "Frankenstein". There are also two small example texts:
taylor_twinkle.txt, which is Jane Taylor's 1806 poem "Twinkle, Twinkle, Little Star", and
anonymous_beer.txt, the folk song "99 Bottles of Beer". You can open any of these files in
atom to read through them and get a sense of the texts we'll be working with in this lab. (To open Frankenstein you can type
atom. Take a look at the contents of the file and notice that there are lowercase and capitalized letters; there is a lot of punctuation and newlines; and there are numbers. Let's say we wanted to count how many times the word
morning occurred in the text. In
atom you can search for the word
morning. The first time you find the word it appears in the phrase "...lay to until the morning, fearing to encounter..." The next time it appears: "...In the morning, however...". Searching more, you'll find "...at morning's dawn...", "...one in the morning; the rain..." and "Morning, dismal and wet". When we count how many times "morning" appears in the text, we're going to want to count each of those occurrences.
To simplify the problem, we will read in the text, lowercase it, then keep all the alphabetic characters and turn everything else into a space character. This will lead to a "cleaner" version of the text where it will be easier to count occurrences of each word. As an example, here is the list of words obtained after cleaning the
taylor_twinkle.txt text and then splitting it into words.
To get this cleaned version of the text, you might want to write a function that takes a single line of text as a parameter and returns the cleaned version of that line. In that function, you can use an accumulator to build a new cleaned string from the original line: for every character in the original line, if the character is alphabetic, you'll add it to the cleaned string; otherwise, you'll add a space character (
' ') to the cleaned string.
Recall that you can split a string into a list of words by using the
split method. Here is an example:
>>> s='twinkle twinkle little star' >>> s.split() ['twinkle', 'twinkle', 'little', 'star']
Now that you've read in all the words in the file and created a clean version of the text, you can count each of the words. Working with the Twinkle text, we could easily use
len to find out that there are 118 words in total. However, what you want to be able to find out is that the word
the occurs 11 times, the word
twinkle occurs 7 times, and the word
diamond appears 1 just one time.
The data structure we are going to use to store the counts will look like this:
>>> counts [['a', 1], ['above', 1], ['and', 2], ['are', 2], ['as', 1], ['blazing', 1], ... ]
Each element of the list will contain two items: a word and the frequency with which the word occurs. We will keep this list sorted alphabetically by word. One reason to do that is so that if we wanted to look up the number of times that
blazing occurs in the text, we could use binary search to quickly find out that it occurred once. Another reason for storing the words alphabetically is that it will allow us to build the list of counts one word at a time.
Let's walk through how to build up this list using a shorter text,
anonymous_kid.txt. This is just a silly sentence that we wrote for the purposes of testing your lab. Here's the sentence: "The kid in the suit is in the car by the house." When we clean the text and read it into a list, we'll get a list like the one shown below (which we'll call
text so we can refer to it later).
>>> text ['the', 'kid', 'in', 'the', 'suit', 'is', 'in', 'the', 'car', 'by', 'the', 'house']
We'll also initialize an empty list called
counts where we'll store the word counts. Recall that this
counts list will be a list of lists like the one shown above. Let's assume that we've already counted the first 3 words in the
text and our
counts look like this:
>>> counts [['in', 1], ['kid', 1], ['the', 1] ]
We now need to count the fourth word in the
text which is the word
the. Using binary search, we can efficiently find that the word
the is already in the
counts list. Since we now see another
the, we'll change the count from 1 to 2 so our
counts now looks like this:
>>> counts [['in', 1], ['kid', 1], ['the', 2]]
The next word in the
text is the word,
suit. When we do a binary search in the
counts list for the word
suit, we fail to find it. This means that we've now seen
suit for the first time, and that means that we want
[ 'suit', 1] to appear in our
counts. In order for binary search to work, we'll need to make sure that we put
suit into the
counts list in sorted order. Fortunately, when the binary search algorithm ends without finding the item, the value
low (from the implementation of binary search) tells us where
suit should go in the list. In this case, when binary search fails,
low is 2. (Trace through binary search if you aren't sure that this is so.) Now we can use the
list.insert(index, object) method to insert
[ 'suit', 1] at index 2:
>>> low 2 >>> word 'suit' >>> counts.insert(low, [word, 1]) >>> counts [['in', 1], ['kid', 1], ['suit', 1], ['the', 2]]
We will continue with this algorithm, using binary search to locate each word from the
text in our
counts list and either incrementing the count by 1 (as we did for
the) or inserting a new list with
[word, 1] (as we did with
suit). When we are all done, we'll have the count for every word in the text. Here's the output we get for the
[['by', 1], ['car', 1], ['house', 1], ['in', 2], ['is', 1], ['kid', 1], ['suit', 1], ['the', 4]]
Try to do the
anonymous_kid.txt example by hand to make sure you understand how binary search and insert are working to create the
counts list. You might want to convince yourself that the algorithm works when starting from an empty list and building up the whole list.
Here are the counts from two other files for you to compare against:
Note that in Lab 08 the only thing we will be using the counts for is determining the most frequent word. Don't worry: we will use these counts extensively in Lab 09!
The second component of the lab is to allow the user to see how words are used in the context of the text. For example, a student investigating Shelley's "Frankenstein" might be interested in seeing how the word "monster" is used in the text. You will ask the user to enter a word and then show them (in a nicely formatted way), the 4 words before and 4 words after every occurrence of the word the user asked to see. (See the sample runs above for details.)
The basic idea of this algorithm is to start at the beginning of the
text list you started out with and do a linear search for the word the user is interested in. When you find the word, accumulate a string of the 4 words before the word and accumulate another string of the 4 words after the word you searched for. Use string formatting to make it all look nicely aligned.
The only tricky part of this is that sometimes the user might want to see a word that is close enough to the beginning or end that there aren't actually 4 words to show. You'll have to figure out how to make sure the program shows the right context without crashing!
Hint: Remember that you can use string formatting to left or right justify text by using a minus sign before the number in the format string. For example:
>>> s = "hello there" >>> t = "how are you" >>> print("%20s | %20s" % (s, t)) hello there | how are you >>> print("%-20s | %-20s" % (s, t)) hello there | how are you >>> print("%-20s | %20s" % (s, t)) hello there | how are you >>> print("%20s | %-20s" % (s, t)) hello there | how are you
After you have turned in the full program, please edit the
QUESTIONS-08.txt file in your
cs21/labs/08 directory and answer the questions in that file.
Don't forget to run
handin21 to turn in your lab files! You may run
handin21 as many times as you want. Each time it will turn in any new work. We recommend running
handin21 after you complete each program or after you complete significant work on any one program.