CS 35

Program 6

due Friday 9 Mar. by midnight

Word Frequency

You may work with one partner on this assignment. If you do so, either you or your partner should submit your joint solution with both names in the first comment of each file (please do not both submit it). You must put your solution in a subdirectory named by the first 4 letters of your last name and make sure it works on the CS lab machines. You will then send me a tarball or that file.

A number of authorship studies have relied on a statistical analysis of language features including word frequency and word proximity. These techniques have been applied to Shakespeare plays, novels, and the Federalist Papers, among others.

The full studies rely on much more just word frequency and proximity, but for this assignment we will just work on providing tools for studying word frequency, word context, and word proximity.

For any book-length or shorter text, that is on her computer, our scholar wants:

1) the program to be able to determine for each word, how often it occurs and the index of each occurence;

2) The program should output the total number of words (tw) and the number of distinct words (dw);

3) it should allow the user to ask for:

a) given a word, output its frequency in the text.

b) for any number k<= dw, output the k most frequent words and their respective frequencies as if they were sorted in descending order by frequency.

c) output all the unique words and their frequencies in lexicographic order.

d) given a word, output the indices of all occurences of that word in the text

e) given two words and a distance, d, output the number of times the words are within d of each other in the text

4) we would like to write our program in such a way that it could provide more capabilities, be generalized, or provide a GUI interface.

For Tuesday, 6 March, your task is to write code to implement all parts of 1, 2, 3. However, you MUST use a Binary Search Tree (BST) that you implement for the initial structure into which you insert words and increment their frequency counts.

Please follow the Java Code Style recommended in Tia Newhall's page: Java Code Style In addition, make sure that the first line in each source file is a comment that has your name in it.

Happy computing.