CS65: Fall 2008

Introduction

This course will introduce you to a broad range of topics in the area of natural language processing including language modeling, part of speech tagging, spelling correction, morphology, syntactic parsing, semantics and machine translation. If time permits, we may also cover speech recognition, natural language generation or discourse systems.

Class information

Professor: Richard Wicentowski
Office: Science Center 251
Phone: (610) 690-5643
Office hours: Wednesday 1:00-3:00 pm or by appointment

Room: Robot Lab
Time: Tuesday, Thursday 9:55am–11:10am
Text: Jurafsky and Martin, Speech and Language Processing, 2nd edition

Schedule

(exceedingly likely to change)

WEEK	DAY	ANNOUNCEMENTS	TOPIC & READING	HOMEWORK
1	Sep 02		* Jurafsky and Martin, Chapters 1-2: Introduction, Regular Expressions * Lee, L., 2004. "I'm sorry Dave, I'm afraid I can't do that": Linguistics, Statistics, and Natural Language Processing circa 2001 (2up). Computer Science: Reflections on the Field, Reflections from the Field, pp. 111-118. * (Reference) Mertz, D., 2003. Text Processing in Python, Chapter 3.	Lab 1
1	Sep 04			Lab 1
2	Sep 09		* Jurafsky and Martin, Chapter 4: Maximum Likelihood Estimation (MLE), N-gram models for generation and prediction, smoothing, Good-Turing, Kneser-Ney
2	Sep 11	Drop/Add ends (Sep 12)
3	Sep 16		* Klein, S. and Simmons, R., 1963. A computational approach to grammatical coding of English words (2up). Journal of the Association for Computational Machinery 10, pp. 334-347. * Stolz, W. et al, 1965. A Stochastic Approach to the Grammatical Coding of English (2up). Communications of the ACM 8:6, pp. 399-405.	Lab 2
3	Sep 18		* Jurafsky and Martin, Chapter 5 sections 5.1-5.4, 5.6 * Brill, E., 1995. Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part of Speech Tagging (2up). Computational Linguistics 21:4, pp. 1-37 * Brants, T., 2000. TnT - a statistical part-of-speech tagger (2up). Proceedings of the 6th Applied Natural Language Processing Conference.	Lab 2
4	Sep 23		* Jurafsky and Martin, Section 2.2 (FSA), Section 3.4 (FST), Section 5.5 (HMM POS Tagging), Chapter 6 (HMMs, up to and including 6.5), Minimum Edit Distance	Lab 3
4	Sep 25			Lab 3
5	Sep 30
5	Oct 02		* Jurafsky and Martin, skim Sections 7.1-7.3; read Chapter 11 * (Reference) Class slides * (Reference) Harris, Z., 1955. From phoneme to morpheme (2up). Language, 31:190-222. * (Reference) Harris, Z., 1967. Morpheme boundaries within words: Report on a computer test (2up). In Transformations and Discourse Analysis Papers. Department of Linguistics, University of Pennsylvania. * (Reference) Hafer, M. and Weiss, S., 1974. Word segmentation by letter success varieties (2up). Information Storage and Retrieval, 10:371-385.
6	Oct 07		Morphology Induction * Goldsmith, J., 2001. Unsupervised Learning of the Morphology of a Natural Language (2up). In Computational Linguistics, 27(2):153-198.
6	Oct 09	Exam #1
	Oct 14	October Holiday
	Oct 16	October Holiday
7	Oct 21		* Goldsmith, J., 2001 (continued) * Exam discussion * Final project discussion ] NLTK Projects (Python) ] Senseclusters TODO (Perl) ] NSP TODO (Perl)
7	Oct 23		* Schone, P. and Jurafsky, D., 2000. Knowledge-Free Induction of Morphology Using Latent Semantic Analysis (2up) * Yarowsky, D. and Wicentowski, R., 2000. Minimally Supervised Morphological Analysis by Multimodal Alignment (2up) * Schone, P. and Jurafsky, D., 2001. Knowledge-Free Induction of Inflectional Morphologies (2up)
8	Oct 28
8	Oct 30
9	Nov 04		* Computational Lexical Semantics: Jurafsky and Martin, Chapter 19.1-19.3, 20.1-20.6
9	Nov 06	Last day to declare CR/NC or withdraw with a W (Nov 07)	* McCarthy, D. et al, 2004. Finding Predominant Word Senses in Untagged Text
10	Nov 11		* Purandare, A. and Pedersen, T, 2004. Word Sense Discrimination by Clustering Contexts in Vector and Similarity Spaces
10	Nov 13		* Knight, K., 1997. Automating Knowledge Acquisition for Machine Translation. AI Magazine, Volume 18, No. 4, 1997. (SECTIONS 3 AND 4: OPTIONAL)
11	Nov 18		* Knight, K., 1999. A Statistical MT Tutorial Workbook. Prepared for the 1999 JHU Summer Workshop.
11	Nov 20		Rachel Gale, W. and Church, K. A program for aligning sentences in bilingual corpora. Proceedings of ACL. 1991. Rio and Ryan Tintarev, N. and Masthoff, J., Similarity for news recommender systems. Adaptive Hypermedia and Adaptive Web-Based Systems Workshop on Recommender Systems and Intelligent User Interfaces. 2006. Brian and Dougal (G&C sec 1&2, R&K sec 1) Gorman, J. and Curran, J., Scaling Distributional Similarity to Large Corpora. Proceedings of ACL. 2006. Rychlý, P. and Kilgarriff, A., An efficient algorithm for building a distributional thesaurus. Proceedings of ACL. 2007.
12	Nov 25		Meggie and Malcolm (Read sec 2, 3, 5.3) Hirst, G. and St-Onge, D., Lexical chains as representations of context for the detection and correction of malapropisms. In WordNet: An Electronic Lexical Database. 1998. Matt T. Barzilay, R. and Elhadad, M., Using Lexical Chains for Text Summarization. In ACL 1997 Workshop on Intelligent Scalable Text Summarization. 1997. Phyo Lin, D., An Information-Theoretic Definition of Similarity. Proceedings of ICML. 1998.
12	Nov 27	Thanksgiving
13	Dec 02		Colin and Matt B. Lin, W-H et al. Which Side are You on? Identifying Perspectives at the Document and Sentence Levels. Proceedings of CoNLL. 2006. Joon and Trilok (Read sec 1&3, skim sec 2) Creutz, M. and Lagus, K., Unsupervised Models for Morpheme Segmentation and Morphology Learning. ACM Transactions on Speech and Language Processing. Volume 4, Number 1. 2007. Jake and Derek Keller, F. et al., Using the web to overcome data sparseness. Proceedings of EMNLP. 2002.
13	Dec 04		Max (Read sec 1 & 2, skim sec 4) Kay, M. and Röscheisen, M., Text-Translation Alignment. Computational Linguistics, Volume 19, Number 1, March 1993, Special Issue on Using Large Corpora. 2nd midterm exam topics
14	Dec 09	Exam #2 Final project due (Dec 16)

Grading

Your overall grade in the course will be determined as follows:

50%	Labs, assignments, and projects
15%	Midterm exams
25%	Final project
10%	Class pariticipation and Attendance

Policy on Programming Assignments

You will submit your assignments electronically using the handin65 program. You may submit your assignment multiple times, but each submission overwrites the previous one and only the final submission will be graded. Normally, late assignments will not be accepted; however, special exceptions can be made if you contact me in advance of the deadline. Even if you do not fully complete an assignment, you may submit what you have done to receive partial credit.

Some assignments may take a considerable amount of time, so you are strongly encouraged to begin working on assignments well before the due date.

Programming Language

Though there is no "required" programming language for the course, some assignments will presuppose knowledge of Python. If you do not know Python, please let me know so I can point you to a good online tutorial. You will almost certainly want to learn some Perl, as well as bash programming, but you are not expected to know this yet.

Please make sure that each program you turn in has:

A comment at the top of the program that includes
- Program authors
- A brief description of what the program does
Concise comments that summarize major sections of your code
Meaningful variable and function names
Well organized code
White space to improve legibility
Lines whose width is less than 80 characters wide (whenever possible)

Academic Integrity

Academic honesty is required in all work you submit to be graded. With the exception of your partner on assignments, you may not submit work done with (or by) someone else, or examine or use work done by others to complete your own work.

You may discuss assignment specifications and requirements with others in the class to be sure you understand the problem. In addition, you are allowed to work with others to help learn the course material. However, with the exception of your lab partner, you may not work with others on your assignments in any capacity.

All code you submit must be your own with the following permissible exceptions: code distributed by me as part of the class, code found in the course text book, and code worked on with your assignment partner. You should always include detailed comments that indicates which parts of the assignment you received help on, and what your sources were.

Please see me if there are any questions about what is permissible.

External Links

Jurafsky and Martin, Speech and Language Processing (2/e), 2008
Manning and Schutze, Foundations of Statistical Natural Language Processing, 1999
Mertz, Text Processing in Python, 2003
NLTK: Natural Language Toolkit
The ACL Anthology
Python Documentation
How To Think Like a Computer Scientist: Learning with Python

CS65: Natural Language Processing