CS65 Labs 04 and 05 combined

Due 11:59pm Wednesday Wednesday, October 8

Download the pdf for the assignment and run update65 to create the lab/04 directory. This will download a number of skeleton files necessary for the lab writeup.

You can populate the trie with words from any source you'd like, including the Brown corpus. Here are some other sources you can use and how you can access them:

import nltk

# the Brown corpus
brown_words = nltk.corpus.brown.words()
# len(set(brown_words)) -> 56057 types

# the Project Gutenberg corpus
gutenberg_words = nltk.corpus.gutenberg.words()
# len(set(gutenberg_words)) -> 51156 types

# the Reuters news corpus
reuters_words = nltk.corpus.reuters.words()
# len(set(reuters_words)) -> 41600 types

# some Scrabble dictionaries (without NLTK)
# Available dictionaries: 'csw12.txt', 'sowpods.txt', 'twl.txt', 'all.txt'
dictionary = open('/data/cs65/corpora/Scrabble/all.txt')
dictionary_words = [x.rstrip() for x in dictionary]
# len(dictionary_words) -> 270560 types

# unigrams from Google (without NLTK)
wordcts = [x.split() for x in open('/data/google/1gms/word_only_sorted')]
# this is pairs of words and frequencies on google; somewhat noisy
# len(set(words)) -> 7058483 types (2507480 lowercase types)

# You can access the British National Corpus through NLTK, but it is 
# ridiculously slow. Instead, you can use this list of types extracted from
# the British National Corpus
bnc_words = [x.rstrip() for x in open('/data/cs65/corpora/BNC/wordtypes.txt')]
# len(bnc_words) -> 431457 types (158220 lowercase types)