Swarthmore College Department of Computer Science

Natural Languge Processing

Dougal Sutherland and Rich Wicentowski
dougal picture

The first month of my research Rich and I tried to come up with a system for automatically extracting the medicines a patient took (including how often they took it, the dosage, reason, and the like) from a transcript of a patient's discharge report. Unfortunately, this ultimately proved to not be very interesting: we didn't have the necessary training data to do anything other than a rules-based parser, which ended up an uninteresting grunt-work programming task.

Because we didn't feel we were adding anything to the task, we switched to working with the ClueWeb09 dataset -- basically a 25,000 gigabyte copy of a sizable chunk of the Internet, in a box. I spent two weeks playing around with the data, writing scripts and building indexes to make it easier to deal with the data, which will be used in the Information Retrieval class.

Finally, I worked on a project that I hope to apply to a sizable subset of this dataset. This is a program that uses the concept of lexical distributional similarity to build a thesaurus of sorts. (The idea is that if we see both the relations "dog"-subject-of-"growl" and "cat"-subject-of-"growl" frequently, "cat" and "dog" are likely to be related.) A thesaurus output by this kind of system is useful for a variety of natural language processing tasks, including context-sensitive spelling correction, entity set expansion, speech recognition software, topic classification and clustering systems, and so on. Once polished, this program will be submitted to an open-source natural languge processing toolkit.