CS75: Project Part 1

CS75 Project: Part 1: Lexical Analyzer for C--

Checkpoint (10%): Parts 1-3, Due Tuesday Feb. 15 at the beginning of class
(you may not use a late day on the checkpoint)

Complete Project (90%): Parts 1-4, Due: Tuesday Feb. 22 BEFORE 9:30 am
Parts 1-3 may be submitted on-line with part 4 (if you do so, please submit parts 1-3 in ACSII, postscript or pdf form), or you may submit a hard copy of parts 1-3 separately in class on the same day that you handin part 4 on-line (this may be a re-submission of your graded checkpoint if you have made no changes to parts 1-3 since the checkpoint).

Problem Introduction
Getting Started
What to Hand in
A Note on Code Style and Grading
C-- Programming Language Specification
On-line Unix and C help

Introduction

Design and implement a lexical analyzer for C-- as follows (a CFG for C-- is given at the bottom of this page):

List the set of token types to be returned by your lexical analyzer.
Define regular expressions for this set of token types.
Derive a single DFA from your regular expressions.
Implement the DFA in a C program. You should implement the lexical analyzer program as a implementation of the DFA where each state in your DFA is a separate function.

Specifications:

The lexical analyzer should provide a procedure called lexan that returns the next token read from a file stream called source.
Keyword tokens should be distinct from identifier tokens, and different identifier tokens should be distinct from one another. These distinctions are made with the help of a symbol table.
The lexical analyzer should convert integer literals, including characters, into numeric values. For example, 'a' should be read as an integer token with value 97, the ASCII code of the character.

The lexical analyzer should skip over comments which can be of the following forms:

/* 
  this is a comment
  that can go for several
  lines
*/

// this is a single line comment, where everything to right is ignored

For now, the lexical analyzer should report any error it finds (i.e. output a meaningful error message including the source code line number in which it occurs) and quit.
On end-of-file, the lexical analyzer should return a special DONE token.
Design your lexer so that it can read input characters from any file (not just from stdin).

Getting Started

The first three parts of this project should be done before you begin writing any code.

First find the tokens from the CFG specification of the language. You may either have one token for punctuation characters (like semicolon) or a single token for all punctuation characters, and then the attribute will specify which token it is. Same with operators and keywords. You should not have tokens for white-space or comments; both should be striped-out by your lexical analyzer.
Once you find the set of tokens for C--, create regular expressions for them, and then create a single DFA that accepts all tokens in C-- in two steps:
1. Convert the regular expressions to a single NFA using algorithm on page 122
2. Convert the NFA to a DFA using algorithm on page 118
Remember that char literals that appear in the source code are represented as integer literals inside the compiler (char literals are represented as their ASCII values).

Once you have the DFA, then simply translate it to your lexical analyzer program, where each DFA state corresponds to a function.

I suggest you use the simplecompiler as a starting point for your solution: ~newhall/public/cs75/simplecompiler/

This contains a Makefile and a reasonable modular design for your solution, both of which you should feel free to change. If you use this as a starting point, you will need to pretty much start over with the code in the lexer.c file, and you will need to change main to call just your lexan function and the emit function (main should not call parse). You can use the error and symbol table functions as they are implemented, or use them as a guide for implementing your own version of the symbol table. Also, you should change the emit function so that it outputs "TOKEN.attr" for every token. After your lexer returns the DONE token, your main function should printout the contents of the symbol table.

For example, if the input is:

if (x1 >= 'a') 

The output of your program should look something like:

IF
LPAREN
ID.x1
GE
NUM.97
RPAREN
DONE

SYMBOL TABLE CONTENTS: 		# if you added keywords to your symbol table, 
----------------------          # they would be printed here too
ID      x1

Also, before submitting, you should change main so that it takes the C-- source code file as a command line argument:

% ./lexan foo.c--  	# assuming lexan is the name of my LA executable

To do this use argc and argv parameters to main (if you don't know how to do this, look in some of the C references in the lab, or ask me or someone else in the class for help):

int main(int argc, char *argv[]) {

Make sure to test your lexical analyzer on both valid and invalid C-- programs (think about the types of errors a lexical analyzer can and cannot detect).

What to Hand in

For Parts 1-3

A report containing the following should be submitted to me in class:

A list of token types from part 1
The regular expressions form part 2
A drawing of your DFA from part 3

You are welcome to turn in a hand written solution or you can use some word processing software. In Unix, you can use xfig to draw figures, and then export them as postscript files that can be imported into a latex document. Some example information about latex is available here.

For Part 4

Submit a tar file with the following contents using cs75handin:

A README file with:
1. Your name and your partner's name
2. The number of late days you have used so far
3. Indicate whether or not your parts 1-3 have changed since the checkpoint
4. A brief explanation of how to run your program with an example command line.
5. A file containing a small sample of output from your program (you can use script command to capture all of a shell's stdin, stdout, stderr to a file, and dos2unix to clean-up the resulting typescript file)
6. Sample test input file(s) that you used to collect the sample output you are submitting
All the source and header files and the makefile needed to compile, run and test your code Do not submit object or executable files.

A Note on Code Style and Grading:

Correctness counts for only part of your grade. To receive an A, your solution must be correct, robust, and efficient, have good modular design, and be well commented. In particular, your comments should include:

A comment at the top of each file with a high-level overview of the code contained in that module (what is this module and what does it do?).

Each function should have a comment containing:

a brief description of what the function does
what the function's input values are
What values the function return(s) (include what is returned on all error conditions as well as regular return values).

For example, something like this is good:

/*
 * Function: Area(float w, float h)
 * Computes the area of a rectangle given its height and width
 *
 * param w: the witdth of the rectangle in inches
 * param h: the height of the rectangle in inches
 * returns: the area of the rectangle 
 *          or -1 if the area cannot be computed due to a bad input value 
 */

The rest of the code should have a few (not overly many) comments to describe important data structures and to describe any complex sequences of code. If a function implements a complicated algorithm, it is good to enumerate the steps of the algorithm in the function's comment, and then have in-line comments of the same steps preceding the code implementing the step: this is a good way to ensure that you remember to implement all steps as you are writing code.
My advice is to write the comment first, then write the code. This way your comment will help guide your code writing. Also, by writing the comment at the same time as the code, your comments and code are more likely to match, and it is more efficient (you do not have to go back later and try to figure out what a function does so that you can write its comment).

In addition, you code should be easily extensible to allow for larger problem sizes. For example, if you use a fixed-sized array for the symbol table then use a constant when referring to its max size in your program (e.g. x = SYMTABSIZE; where SYMTABSIZE is defined as a constant) rather than the numeric value (x = 512;). This way, your program can be easily modified to handle a larger symbol table size by making just one change to the constant's definition.

See the links to Unix and C help at the bottom of the cs75 homepage if you need help using some unix commands or debugging your C program.

C-- Programming Language Specification

Here is a one-page printable version of the grammar given below (to print: lpr grammar.ps):

C-- Grammar

C-- Grammar:
Program ------> VarDeclList FunDeclList 

VarDeclList --> epsilon 
		VarDecl  VarDeclList

VarDecl ------> Type id ;
                Type id [ num] ; 

FunDeclList --> FunDecl 
 		FunDecl  FunDeclList

FunDecl ------> Type id ( ParamDecList ) Block

ParamDeclList --> epsilon 
 		  ParamDeclListTail 

ParamDeclListTail -->  ParamDecl 
 		       ParamDecl,  ParamDeclListTail 

ParamDecl ----> Type id
		Type id[]

Block --------> { VarDeclList StmtList } 

Type ---------> int
                char

StmtList -----> Stmt 
 		Stmt  StmtList 

Stmt ---------> ;
                Expr ; 
                return Expr ;
                read id ;
                write Expr ;
                writeln ;
                break ;
                if ( Expr ) Stmt else Stmt
                while ( Expr ) Stmt 
                Block

Expr ---------> Primary 
                UnaryOp Expr
                Expr BinOp Expr
                id = Expr 
                id [ Expr ] = Expr 

Primary ------> id
                num 
                ( Expr )
                id ( ExprList )
                id [ Expr ] 

ExprList -----> epsilon 
                ExprListTail 

ExprListTail --> Expr 
                 Expr , ExprListTail
 
UnaryOp ------> - | !

BinOp -------->  + | - | * | / | == | != | < | <= | > | >=  | && | ||