CS21: Lab 10

For this lab you will write one program, filter.py, that filters a text file by removing words entered by the user or by removing the most popular words in the text.

First, run update21, if you haven't already, to create the cs21/labs/10 directory. Then cd into your cs21/labs/10 directory to begin working on your program.

This lab is not specifically focused on design or testing methods, but you should continue to use the good practices, such as top-down design, incremental testing, and writing well documented code.

Introduction

In this lab you will write a program that filters text in a file by removing words from it. The English language is quite redundant, so even after removing a number of popular words, the text is often quite understandable.

When the program starts up, you will display a welcome message to the user and then ask the user to enter a file to filter. If the file doesn't exist, you will prompt them again to enter a file until a valid file is entered.

Then you will present them a menu with 4 options. The first option is to filter the file with words entered in by the user. The second option is go through the file counting how many times each word appears in the file and show the user the list of words and their counts. The third option is to use the word counts to filter the file by the most popular words in the file. For this option you will prompt the user to select how many of the top words to use as the filter. The fourth option is to quit the program, giving a goodbye message.

Sample Output

Here's a sample run showing output from the poem "Stopping by Woods on a Snowy Evening".

$python filter.py
Welcome to the text filtering program.

Which file would you like to filter? /usr/local/doc/text/frost.txt
What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 1
Enter the words you would like filtered from the text.
Words: snow woods sleep
Here's your filtered text:

Whose these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his fill up with

My little horse must think it queer
To stop without a farmhouse near
Between the and frozen lake
The darkest evening of the year.

He gives his harness bells a shake
To ask if there is some mistake.
The only other sound's the sweep
Of easy wind and downy flake.

The are lovely, dark and deep,
But I have promises to keep,
And miles to go before I
And miles to go before I

What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 2

Word            Count
---------------------
the             7
to              6
and             5
i               5
woods           4
his             3
a               2
are             2
before          2
go              2
he              2
is              2
miles           2
of              2
sleep           2
think           2
ask             1
bells           1
between         1
but             1
dark            1
darkest         1
deep            1
downy           1
easy            1
evening         1
farmhouse       1
fill            1
flake           1
frozen          1
gives           1
harness         1
have            1
here            1
horse           1
house           1
if              1
in              1
it              1
keep            1
know            1
lake            1
little          1
lovely          1
me              1
mistake         1
must            1
my              1
near            1
not             1
only            1
other           1
promises        1
queer           1
see             1
shake           1
snow            1
some            1
sounds          1
stop            1
stopping        1
sweep           1
there           1
these           1
though          1
up              1
village         1
watch           1
whose           1
will            1
wind            1
with            1
without         1
year            1

What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 3
How many of the top words to filter? 15
Here's your filtered text:

Whose these think know.
house in village though;
will not see me stopping here
watch fill up with snow.

My little horse must think it queer
stop without farmhouse near
Between frozen lake
darkest evening year.

gives harness bells shake
ask if there some mistake.
only other sound's sweep
easy wind downy flake.

lovely, dark deep,
But have promises keep,



What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 4
Goodbye!

More sample output.

Requirements

For this lab we will leave the design and implemention up to you, but be sure to use good top-down design to write well organized code. The requirements for your program are shown below.

All input from the user must be validated.
When searching in the text, you should ignore any punctuation at the start or end of the words, but keep any punctuation in the middle of a word.
Matching words for counting and filtering is case insensitive, so for example, 'A' and 'a' should be considered the same word.
You don't need to preserve multiple spaces between words in the text. If there are several spaces between two words, you are allowed to replace the multiple spaces with a single space in your filtered text.
When filtering the text, keep the same newlines as the original text, and keep the punctuation and capitalization the same in the words you are not filtering. If a word is to be filtered that has punctuation attached with it, you can remove both the word and its punctuation.
When displaying the list of word counts to the user, the list must be sorted by word count. You must write your own sort function to do this. You can not use the python functions sort() or sorted() to do this. You can implement any of the sorting algorithms we discussed in class. For words that have the same word count, show them in alphabetical order.
You code should build the word count list as efficiently as possible. There are a couple of ways to go about this.
1. One way is to convert the text file to a list of words as they appear in the text, but converted to lowercase and stripped of left and right punctuation. Then you can sort your list of words. Once the list is sorted, the same words will be grouped together. You can then get the counts of the words with one pass of the list.
2. The other way is to this is to do add words to the count list as they appear in the text. For each word in the text, look that word up in your count list. If the word is there, increase the count of that word by one, otherwise add that word to the list with a count of one. Since you will be searching the list for every word in the text, you should make sure the list is ordered and use a binary search on the list. To help with this, your binary search should return a list of two items [index, found]. If the word is in your list, index is the location in your list and found would be True. If the word is not in the list, the index is the location where the word should be added to the list, and found would be False. To actually add the new word in the list, use the index your returned from your sort, and use the insert() list method to add it to the list at the right location.
When sorting the list by wordcount, you will be sorting the list from largest to smallest. You'll have to adjust your sorting algorithm accordingly.

Implementation Tips

You can strip on multiple items by giving strip a string of characters to strip.
The string library has a punctuation string that may be helpful: from string import punctuation to use.
For checking if a word in the text is in the list of words you should filter, it is okay to use the python in operator.
If you are building the count list by adding one word at a time and the word is not found by your binary search, the proper index to add the new word will be either the value of mid or mid+1, depending on if the word you are adding comes before or after the word at index mid.
There are several files in /usr/local/doc/text for you to try. Try your program with some of the shorter texts first to make it easier to check if your program is correct.
- mary.txt -- The poem "Mary had a Little Lamb".
- declaration.txt -- The Declaration of Independence.
- gettysburg.txt -- Lincoln's Gettysburg address.
- romeo.txt -- A speech from "Romeo and Juliet".
- frost.txt -- "Stopping by Woods on a Snowy Evening" by Robert Frost.

Hacker Challenge

Here are some ideas for you to extend the lab.

With a large enough text, the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, etc. This is known as Zipf's law. Use matplotlib to plot the word counts for the texts we have given you and see how closely they follow Zipf's law.
For longer texts, the wordcount list can be quite long and scrolls off the screen. Modify your program to show the wordcount one screen at a time, having the user hit the 'enter' key to show the next screen. Here is some magic python code to give you the number of lines in the window your program is running in:
import os
lines = int(os.popen('tput lines', 'r').readline())

Submit

Once you are satisfied with your programs, hand them in by typing handin21 at the Linux prompt.

You may run handin21 as many times as you like, and only the most recent submission will be recorded. This is useful if you realize, after handing in some programs, that you'd like to make a few more changes to them.

CS21 Lab 10: Text Filtering