CS75 Lab 1: A Lexical Analyzer for C--

Due Dates
Checkpoint (10%): Parts 1-3, Due Tuesday Feb. 9 before at the beginning of class
(you may not use a late day on the checkpoint)

Complete Project (90%): Parts 1-4, Due: Tuesday Feb. 17 BEFORE 2 am (late Monday night)

You should submit a hard-copy of the checkpoint to me (stapled and with your names on it).

You should submit parts 1-3 when you submit part 4. You can do so electronically (in ASCII, postscript or pdf) with the part 4 files you tar up and submit via cs75handin, or you can give me a hard copy of your solution to parts 1-3 (if you changed it since the checkpoint, you should submit your new version of parts 1-3).

Index
Problem Introduction
Getting Started
What to Hand in
A Note on Code Style and Grading
C-- Programming Language Specification
On-line Unix and C help
Project Part 1 FAQ
I'll post answers to questions I get about this assignment here:
Problem Introduction
You will design and implement a lexical analyzer for C-- in 4 parts (a CFG for C-- is given at the bottom of this page).

Project Parts:

  1. List the set of token types to be returned by your lexical analyzer.
  2. Define regular expressions for this set of token types.
  3. Derive a single DFA from your regular expressions.
  4. Implement the DFA in a C program. Your lexical analyzer program will be an implementation of the DFA from part 3, where each state in your DFA is a separate function.

Specifications:

Getting Started

Starting Point Code

Take a look at last week's lab (lab 0) where we set up subversion, checked out the starting point code for the lab assignments, did a walk through the C code layout and the makefiles, and went through some examples of using gdb and valgrind.

Off the lab 0 page are links to information about using svn, make, gdb, valgrind.

For this assignment, you will be adding code to the lexer and includes subdirectories in the starting point.

Parts 1-3

The first three parts of this project should be done before you begin writing any code.
  1. First find the tokens from the CFG specification of the language. You may either have one token for punctuation characters (like semicolon) or a single token for all punctuation characters, and then the attribute will specify which token it is. Same with operators and keywords. You should not have tokens for white-space or comments; both should be striped-out by your lexical analyzer.
  2. Once you find the set of tokens for C--, create regular expressions for them, and then create a single DFA that accepts all tokens in C-- in two steps:
    1. Convert the regular expressions to a single NFA using the algorithm on page 159
    2. Convert the NFA to a DFA using algorithm on page 153
    Remember that char literals that appear in the source code are represented as integer literals inside the compiler (char literals are represented as their ASCII values).

    whitespace and comments

    Your DFA should include transitions to whitespace and comment DFAs where appropriate.

    Whitespace characters in C are: the space character (' '), tab ('\t'), form-feed ("\f"), newline ('\n'), carriage return ('\r'), and vertical tab ('\v'). (see the man page for isspace)

    There are two forms of C style comments that you need to support:

    // everything after slash slash to the end of the line ('\n') is a comment
    
    /*  
      everything between matching slash-star and star-slash is a comment
    	(this style can can span multiple lines)
    */
    
    You do not need to handle comments where */ appears inside double quotes within the comment (i.e. /* "*/" */ is not a valid comment in C--, it was described as a valid comment in problem 3.3.5 from hw2). You may assume that as soon as you read in a */ (after reading in /*), you are at the end of the comment. gcc doesn't recognize the special case from problem 3.3.5 as a valid C comment (thanks to Meggie and Adam who tried it out), and even if it did, you do not need to handle this crazy case in your comment DFA.

Part 4

Once you have the DFA, then simply translate it to your lexical analyzer program, where each DFA state corresponds to a function.

I suggest you look at the code in simplecompiler as hint at how to structure your lexical analyzer code: ~newhall/public/cs75/simplecompiler/

Your lexical analyzer does not need to create a symbol table (we will add this in a later part of the compiler).

Your emit function should ouput TOKEN.attr for every token. Some tokens may not have attributes, but for those that do, your lexical analyzer should output their values.

For example, if the input is:

if (x1 >= 'a') 

The output of your program should look something like:

IF
LPAREN
ID.x1
GE
NUM.97
RPAREN
DONE
Your lexical analyzer should take a C-- source code file as a command line argument:
% ./lexan foo.c--  	# assuming lexan is the name of my LA executable
To do this use argc and argv parameters to main (main.c in the staring point code you grabbed in lab 0 has an example of how to do this):
int main(int argc, char *argv[]) {

You can use fgetc or getc to read one character at a time from the input file, and ungetc to put-back one character the input stream. See the man pages for more details about using these.

Make sure to test your lexical analyzer on both valid and invalid C-- programs (think about the types of errors a lexical analyzer can and cannot detect).

What to Hand in

For Parts 1-3

A hardcopy of a report containing the following should be submitted to me in class:
  1. A list of token types from part 1
  2. The regular expressions form part 2
  3. A drawing of your DFA from part 3
You are welcome to turn in a hand written solution or you can use some word processing software. In Unix, you can use xfig to draw figures, and then export them as postscript files that can be imported into a latex document. Some example information about latex is available here.

For Part 4

Submit a tar file with the following contents using cs75handin:
  1. A README.proj1 file with:
    1. Your name and your partner's name
    2. The number of late days you have used so far
    3. Indicate whether or not your parts 1-3 have changed since the checkpoint
    4. A brief explanation of how to run your program with an example command line.
    5. A file containing a small sample of output from your program (you can use script command to capture all of a shell's stdin, stdout, stderr to a file, and dos2unix to clean-up the resulting typescript file)
    6. Sample test input file(s) that you used to collect the sample output you are submitting

  2. All the source and header files and the makefile needed to compile, run and test your code Do not submit object or executable files.

    The easiest thing to do is to create a tar file out of your top-level project subdirectory of your svn repository (you could co a clean version to tar up or tar up a cleaned-up version of one of your checked out version):

    $ cd ~/cs75/CS75_inits/project
    $ vi README.proj1 	# add your README file
    $ make clean
    $ cd ..
    $ tar cvf proj1.tar project
    
A Note on Code Style and Grading:
Correctness counts for only part of your grade. To receive an A, your solution must be correct, robust, and efficient, have good modular design, and be well commented. I am picky well commented code, and I suggest you read my C code style guide section about commenting C code. My general advice about commenting is to write the comment first, then write the code. This way your comment will help guide your code writing. Also, by writing the comment at the same time as the code, your comments and code are more likely to match, and it is more efficient (you do not have to go back later and try to figure out what a function does so that you can write its comment).

In addition, code you write should be easily extensible to allow for larger problem sizes. For example, if you use a fixed-sized char array to store the lexeme of the current token, then use a constant to define the size of this array (e.g. #define MAX_LEXEME 512) and use the constant rather than its value in your code (e.g. x = MAX_LEXEME; rather than x = 512;). This way, your program can be easily modified to handle a larger max lexeme size by making just one change to the constant's definition.

C-- Programming Language Specification:
Here is a one-page printable version of the C-- grammar that is listed below: grammar.pdf

C-- Grammar:

Program ------> VarDeclList FunDeclList 

VarDeclList --> epsilon 
		VarDecl  VarDeclList

VarDecl ------> Type id ;
                Type id [ num] ; 

FunDeclList --> FunDecl 
 		FunDecl  FunDeclList

FunDecl ------> Type id ( ParamDecList ) Block

ParamDeclList --> epsilon 
 		  ParamDeclListTail 

ParamDeclListTail -->  ParamDecl 
 		       ParamDecl,  ParamDeclListTail 

ParamDecl ----> Type id
		Type id[]

Block --------> { VarDeclList StmtList } 

Type ---------> int
                char

StmtList -----> Stmt 
 		Stmt  StmtList 

Stmt ---------> ;
                Expr ; 
                return Expr ;
                read id ;
                write Expr ;
                writeln ;
                break ;
                if ( Expr ) Stmt else Stmt
                while ( Expr ) Stmt 
                Block

Expr ---------> Primary 
                UnaryOp Expr
                Expr BinOp Expr
                id = Expr 
                id [ Expr ] = Expr 

Primary ------> id
                num 
                ( Expr )
                id ( ExprList )
                id [ Expr ] 

ExprList -----> epsilon 
                ExprListTail 

ExprListTail --> Expr 
                 Expr , ExprListTail
 
UnaryOp ------> - | !

BinOp -------->  + | - | * | / | == | != | < | <= | > | >=  | && | ||