For this lab you will write one program, filter.py, that
filters a text file by removing words entered by the user or by
removing the most popular words in the text.
First, run update21, if you haven't already, to create the
cs21/labs/10 directory. Then cd into your cs21/labs/10
directory to begin working on your program.
This lab is not specifically focused on design or testing methods,
but you should continue to use the good practices, such as top-down
design, incremental testing, and writing well documented code.
Introduction
In this lab you will write a program that filters text in a file by
removing words from it. The English language is quite redundant, so
even after removing a number of popular words, the text is often
quite understandable.
When the program starts up, you will display a welcome message to
the user and then ask the user to enter a file to filter. If the
file doesn't exist, you will prompt them again to enter a file until
a valid file is entered.
Then you will present them a menu with 4 options. The first option
is to filter the file with words entered in by the user. The second
option is go through the file counting how many times each word
appears in the file and show the user the list of words and their
counts. The third option is to use the word counts to filter the
file by the most popular words in the file. For this option you
will prompt the user to select how many of the top words to use as
the filter. The fourth option is to quit the program, giving a
goodbye message.
Sample Output
Here's a sample run showing output from the poem "Stopping by Woods on
a Snowy Evening".
$python filter.py
Welcome to the text filtering program.
Which file would you like to filter? /usr/local/doc/text/frost.txt
What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 1
Enter the words you would like filtered from the text.
Words: snow woods sleep
Here's your filtered text:
Whose these are I think I know.
His house is in the village though;
He will not see me stopping here
To watch his fill up with
My little horse must think it queer
To stop without a farmhouse near
Between the and frozen lake
The darkest evening of the year.
He gives his harness bells a shake
To ask if there is some mistake.
The only other sound's the sweep
Of easy wind and downy flake.
The are lovely, dark and deep,
But I have promises to keep,
And miles to go before I
And miles to go before I
What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 2
Word Count
---------------------
the 7
to 6
and 5
i 5
woods 4
his 3
a 2
are 2
before 2
go 2
he 2
is 2
miles 2
of 2
sleep 2
think 2
ask 1
bells 1
between 1
but 1
dark 1
darkest 1
deep 1
downy 1
easy 1
evening 1
farmhouse 1
fill 1
flake 1
frozen 1
gives 1
harness 1
have 1
here 1
horse 1
house 1
if 1
in 1
it 1
keep 1
know 1
lake 1
little 1
lovely 1
me 1
mistake 1
must 1
my 1
near 1
not 1
only 1
other 1
promises 1
queer 1
see 1
shake 1
snow 1
some 1
sounds 1
stop 1
stopping 1
sweep 1
there 1
these 1
though 1
up 1
village 1
watch 1
whose 1
will 1
wind 1
with 1
without 1
year 1
What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 3
How many of the top words to filter? 15
Here's your filtered text:
Whose these think know.
house in village though;
will not see me stopping here
watch fill up with snow.
My little horse must think it queer
stop without farmhouse near
Between frozen lake
darkest evening year.
gives harness bells shake
ask if there some mistake.
only other sound's sweep
easy wind downy flake.
lovely, dark deep,
But have promises keep,
What would you like to do?
1. Filter file with selected words
2. Show the word counts of the file
3. Filter file by the most popular words in the file
4. Quit
Choice? 4
Goodbye!
More sample output.
Requirements
For this lab we will leave the design and implemention up to you, but
be sure to use good top-down design to write well organized code. The
requirements for your program are shown below.
- All input from the user must be validated.
- When searching in the text, you should ignore any punctuation at
the start or end of the words, but keep any punctuation in the
middle of a word.
- Matching words for counting and filtering is case insensitive,
so for example, 'A' and 'a' should be considered the same word.
- You don't need to preserve multiple spaces between words in the
text. If there are several spaces between two words, you are
allowed to replace the multiple spaces with a single space in your
filtered text.
- When filtering the text, keep the same newlines as the original
text, and keep the punctuation and capitalization the same in the
words you are not filtering. If a word is to be filtered that has
punctuation attached with it, you can remove both the word and its
punctuation.
- When displaying the list of word counts to the user, the list
must be sorted by word count. You must write your own sort function
to do this. You can not use the python functions sort()
or sorted() to do this. You can implement any of the
sorting algorithms we discussed in class. For words that have the
same word count, show them in alphabetical order.
- You code should build the word count list as efficiently as
possible. There are a couple of ways to go about this.
- One way is to convert the text file to a list of words as
they appear in the text, but converted to lowercase and stripped of
left and right punctuation. Then you can sort your list of words.
Once the list is sorted, the same words will be grouped together.
You can then get the counts of the words with one pass of the
list.
- The other way is to this is to do add words to the count
list as they appear in the text. For each word in the text,
look that word up in your count list. If the word is there,
increase the count of that word by one, otherwise add that word
to the list with a count of one. Since you will be searching
the list for every word in the text, you should make sure the
list is ordered and use a binary search on the list. To help
with this, your binary search should return a list of two items
[index, found]. If the word is in your list, index is the
location in your list and found would be True. If the word is
not in the list, the index is the location where the word should
be added to the list, and found would be False. To actually add
the new word in the list, use the index your returned from your
sort, and use the insert() list method to add it to the
list at the right location.
- When sorting the list by wordcount, you will be sorting the
list from largest to smallest. You'll have to
adjust your sorting algorithm accordingly.
Implementation Tips
- You can strip on multiple items by giving strip a string of
characters to strip.
- The string library has a punctuation string that may be
helpful: from string import punctuation to use.
- For checking if a word in the text is in the list of words you
should filter, it is okay to use the python in operator.
- If you are building the count list by adding one word at a time
and the word is not found by your binary search, the proper index to
add the new word will be either the value of mid or mid+1, depending
on if the word you are adding comes before or after the word at
index mid.
- There are several files in /usr/local/doc/text
for you to try. Try your program with some of the shorter texts
first to make it easier to check if your program is correct.
- mary.txt -- The poem "Mary had a Little Lamb".
- declaration.txt -- The Declaration of Independence.
- gettysburg.txt -- Lincoln's Gettysburg address.
- romeo.txt -- A speech from "Romeo and Juliet".
- frost.txt -- "Stopping by Woods on a Snowy Evening"
by Robert Frost.
Hacker Challenge
Here are some ideas for you to extend the lab.
- With a large enough text, the most frequent word will occur
approximately twice as often as the second most frequent word, three
times as often as the third most frequent word, etc. This is known as
Zipf's law. Use matplotlib to plot the word counts for the texts we
have given you and see how closely they follow Zipf's law.
- For longer texts, the wordcount list can be quite long and scrolls
off the screen. Modify your program to show the wordcount one
screen at a time, having the user hit the 'enter' key to show the
next screen. Here is some magic python code to give you the number
of lines in the window your program is running in:
import
os
lines = int(os.popen('tput lines', 'r').readline())
Submit
Once you are satisfied with your programs, hand them in by typing
handin21 at the Linux prompt.
You may run handin21 as many times as you like, and only the
most recent submission will be recorded. This is useful if you realize,
after handing in some programs, that you'd like to make a few more
changes to them.