CS75: Lexical Analyzer

Due Dates

Checkpoint (10%): Parts 1-3, Due Tuesday Feb. 9 before at the beginning of class
(you may not use a late day on the checkpoint)

Complete Project (90%): Parts 1-4, Due: Tuesday Feb. 17 BEFORE 2 am (late Monday night)

You should submit a hard-copy of the checkpoint to me (stapled and with your names on it).

You should submit parts 1-3 when you submit part 4. You can do so electronically (in ASCII, postscript or pdf) with the part 4 files you tar up and submit via cs75handin, or you can give me a hard copy of your solution to parts 1-3 (if you changed it since the checkpoint, you should submit your new version of parts 1-3).

Index

Problem Introduction
Getting Started
What to Hand in
A Note on Code Style and Grading
C-- Programming Language Specification
On-line Unix and C help

Project Part 1 FAQ

I'll post answers to questions I get about this assignment here:

lexemes: it is okay to assume that there is some max size of lexemes in your program (use a #define and set it to something large like 256 or larger). You can then use a statically declared array for storing a token's lexeme. gcc, of course, would not do this and would support lexemes of any length by dynamically allocating a large enough array to store it, but for your compiler it is fine to assume that you will not encounter programs with identifiers of more than 256 characters.
To get the value of integer and character literals from a C-- program: atoi converts a C string of digit characters to an integer. For example.
```
	int x;
	x = atoi("1234");  // x gets the integer value 1234
	
```
For character values, if you just assign the value of character to a variable the variable's value is that of the ascii value of the character:
```
	x = 'c';   // x gets the asscii value of c, which is 99
	x = '\n';  // x gets the asscii value of '\n', which is 10
	
```
Since your lexer is reading in one character at a time, for the first case you need to build up a string of digit characters (remember that in C you need to explicitly null terminate any string you build up one character at a time), and for the second case you need to grab the appropriate character (the special characters with backslashes (e.g. '\n') need to be handled differently have to be handled differently since (i.e. "\n" and '\n' are two very different things in C). Remember that your lexer will read in the character ', then the character \, then the character n, and then the character ' for the case of the character literal '\n'.
NUM in the grammar represents int values or char values that appear in a C-- program, so any of the following are valid C-- NUM's:
```
	1234           0           23           'p'          '\n'
	
```
You may assume that all int values will appear in decimal (i.e. 10 as opposed to 0xa); you do not need to recognize int values expressed in hexidecimal (e.g. 0xf13e) as a NUM, although it should not be to much extra work to add support for hexidecimal, and you can certainly add support to your compiler for this if you'd like.
Because the only two types supported by C-- are int and char, you do not need to worry about numbers with decimal points or numbers that use the exponent format (i.e. 23.5 is not a valid NUM in C--).
In C, characters are represented internally as their ascii value, and this is what you will do in C-- as well. See the man page for ascii if you are interested in seeing the ascii values. In C if you do the following you can see that x stores the value 98 (the ascii value of the character 'b') it does not somehow store a ' and a b and a ':
```
	x = 'b';
	printf("%d\n", x);
	
```

You do not need to support the special case of comments with a */ inside them that is between double quotes (i.e. /* "*/" */ is an invalid comment in C--). This form was from problem 3.3.5 in hw2, and it is not a comment form you need to support (gcc doesn't). I've added a note about whitespace and comments to the "Getting Started" section to help clarify this.

The lexical analyizer is the only part of the compiler that has access to the source code. As a result, it is the only part that can detect source code line numbers (i.e. when it reads a '\n' in the source file, it knows that the next chararacters are on the next line in the source file). Source code line numbers are not necessary to compile a program, but they are necessary for your compiler outputing useful error messages, most of which will come from the parser and a few from the code generator. As a result, line number information needs to be passed on to the other parts of the compiler. The lexer could either define a global variable to store the current line number that the parser can then access as it parses the stream of tokens, or lexer could associate a line number attribute with each token.

Problem Introduction

You will design and implement a lexical analyzer for C-- in 4 parts (a CFG for C-- is given at the bottom of this page).

Project Parts:

List the set of token types to be returned by your lexical analyzer.
Define regular expressions for this set of token types.
Derive a single DFA from your regular expressions.
Implement the DFA in a C program. Your lexical analyzer program will be an implementation of the DFA from part 3, where each state in your DFA is a separate function.

Specifications:

The lexical analyzer should provide a function called lexan that returns the next token read from a file stream.
Keyword tokens should be distinct from identifier tokens, and different identifier tokens should be distinct from one another by associating attributes with each ID token.
The lexical analyzer should convert integer literals, including characters, into numeric values. For example, 'a' should be read as an integer token with value 97, the ASCII code of the character.
The lexical analyzer should skip over comments which can be of the following forms:
```
/* 
  this is a comment
  that can go for several
  lines
*/

// this is a single line comment, where everything to right of // is ignored
```
Do not worry about handling comments with a */ between matching parens (like the problem in hw2).
For now, the lexical analyzer should report any error it finds (i.e. output a meaningful error message including the source code line number in which it occurs) and exit.
On end-of-file, the lexical analyzer should return a special DONE token.
Design your lexer so that it can read input characters from any file (not just from stdin).
The lexical analyzer main program should take a command line argument that is the name of the C-- source file.

Getting Started

Starting Point Code

Take a look at last week's lab (lab 0) where we set up subversion, checked out the starting point code for the lab assignments, did a walk through the C code layout and the makefiles, and went through some examples of using gdb and valgrind.

Off the lab 0 page are links to information about using svn, make, gdb, valgrind.

For this assignment, you will be adding code to the lexer and includes subdirectories in the starting point.

Parts 1-3

The first three parts of this project should be done before you begin writing any code.

First find the tokens from the CFG specification of the language. You may either have one token for punctuation characters (like semicolon) or a single token for all punctuation characters, and then the attribute will specify which token it is. Same with operators and keywords. You should not have tokens for white-space or comments; both should be striped-out by your lexical analyzer.
Once you find the set of tokens for C--, create regular expressions for them, and then create a single DFA that accepts all tokens in C-- in two steps:
1. Convert the regular expressions to a single NFA using the algorithm on page 159
2. Convert the NFA to a DFA using algorithm on page 153
Remember that char literals that appear in the source code are represented as integer literals inside the compiler (char literals are represented as their ASCII values).

whitespace and comments
Your DFA should include transitions to whitespace and comment DFAs where appropriate.
Whitespace characters in C are: the space character (' '), tab ('\t'), form-feed ("\f"), newline ('\n'), carriage return ('\r'), and vertical tab ('\v'). (see the man page for isspace)
There are two forms of C style comments that you need to support:
```
// everything after slash slash to the end of the line ('\n') is a comment

/*  
  everything between matching slash-star and star-slash is a comment
	(this style can can span multiple lines)
*/
```
You do not need to handle comments where */ appears inside double quotes within the comment (i.e. /* "*/" */ is not a valid comment in C--, it was described as a valid comment in problem 3.3.5 from hw2). You may assume that as soon as you read in a */ (after reading in /*), you are at the end of the comment. gcc doesn't recognize the special case from problem 3.3.5 as a valid C comment (thanks to Meggie and Adam who tried it out), and even if it did, you do not need to handle this crazy case in your comment DFA.

Part 4

Once you have the DFA, then simply translate it to your lexical analyzer program, where each DFA state corresponds to a function.

I suggest you look at the code in simplecompiler as hint at how to structure your lexical analyzer code: ~newhall/public/cs75/simplecompiler/

Your lexical analyzer does not need to create a symbol table (we will add this in a later part of the compiler).

Your emit function should ouput TOKEN.attr for every token. Some tokens may not have attributes, but for those that do, your lexical analyzer should output their values.

For example, if the input is:

if (x1 >= 'a') 

The output of your program should look something like:

IF
LPAREN
ID.x1
GE
NUM.97
RPAREN
DONE

Your lexical analyzer should take a C-- source code file as a command line argument:

% ./lexan foo.c--  	# assuming lexan is the name of my LA executable

To do this use argc and argv parameters to main (main.c in the staring point code you grabbed in lab 0 has an example of how to do this):

int main(int argc, char *argv[]) {

You can use fgetc or getc to read one character at a time from the input file, and ungetc to put-back one character the input stream. See the man pages for more details about using these.

Make sure to test your lexical analyzer on both valid and invalid C-- programs (think about the types of errors a lexical analyzer can and cannot detect).

What to Hand in

For Parts 1-3

A hardcopy of a report containing the following should be submitted to me in class:

A list of token types from part 1
The regular expressions form part 2
A drawing of your DFA from part 3

You are welcome to turn in a hand written solution or you can use some word processing software. In Unix, you can use xfig to draw figures, and then export them as postscript files that can be imported into a latex document. Some example information about latex is available here.

For Part 4

Submit a tar file with the following contents using cs75handin:

A README.proj1 file with:
1. Your name and your partner's name
2. The number of late days you have used so far
3. Indicate whether or not your parts 1-3 have changed since the checkpoint
4. A brief explanation of how to run your program with an example command line.
5. A file containing a small sample of output from your program (you can use script command to capture all of a shell's stdin, stdout, stderr to a file, and dos2unix to clean-up the resulting typescript file)
6. Sample test input file(s) that you used to collect the sample output you are submitting
All the source and header files and the makefile needed to compile, run and test your code Do not submit object or executable files.
The easiest thing to do is to create a tar file out of your top-level project subdirectory of your svn repository (you could co a clean version to tar up or tar up a cleaned-up version of one of your checked out version):
```
$ cd ~/cs75/CS75_inits/project
$ vi README.proj1 	# add your README file
$ make clean
$ cd ..
$ tar cvf proj1.tar project
```

A Note on Code Style and Grading:

Correctness counts for only part of your grade. To receive an A, your solution must be correct, robust, and efficient, have good modular design, and be well commented. I am picky well commented code, and I suggest you read my C code style guide section about commenting C code. My general advice about commenting is to write the comment first, then write the code. This way your comment will help guide your code writing. Also, by writing the comment at the same time as the code, your comments and code are more likely to match, and it is more efficient (you do not have to go back later and try to figure out what a function does so that you can write its comment).

In addition, code you write should be easily extensible to allow for larger problem sizes. For example, if you use a fixed-sized char array to store the lexeme of the current token, then use a constant to define the size of this array (e.g. #define MAX_LEXEME 512) and use the constant rather than its value in your code (e.g. x = MAX_LEXEME; rather than x = 512;). This way, your program can be easily modified to handle a larger max lexeme size by making just one change to the constant's definition.

C-- Programming Language Specification:

Here is a one-page printable version of the C-- grammar that is listed below: grammar.pdf

C-- Grammar:

Program ------> VarDeclList FunDeclList 

VarDeclList --> epsilon 
		VarDecl  VarDeclList

VarDecl ------> Type id ;
                Type id [ num] ; 

FunDeclList --> FunDecl 
 		FunDecl  FunDeclList

FunDecl ------> Type id ( ParamDecList ) Block

ParamDeclList --> epsilon 
 		  ParamDeclListTail 

ParamDeclListTail -->  ParamDecl 
 		       ParamDecl,  ParamDeclListTail 

ParamDecl ----> Type id
		Type id[]

Block --------> { VarDeclList StmtList } 

Type ---------> int
                char

StmtList -----> Stmt 
 		Stmt  StmtList 

Stmt ---------> ;
                Expr ; 
                return Expr ;
                read id ;
                write Expr ;
                writeln ;
                break ;
                if ( Expr ) Stmt else Stmt
                while ( Expr ) Stmt 
                Block

Expr ---------> Primary 
                UnaryOp Expr
                Expr BinOp Expr
                id = Expr 
                id [ Expr ] = Expr 

Primary ------> id
                num 
                ( Expr )
                id ( ExprList )
                id [ Expr ] 

ExprList -----> epsilon 
                ExprListTail 

ExprListTail --> Expr 
                 Expr , ExprListTail
 
UnaryOp ------> - | !

BinOp -------->  + | - | * | / | == | != | < | <= | > | >=  | && | ||

CS75 Lab 1: A Lexical Analyzer for C--