Complete Project (90%): Parts 1-4, Due: Tuesday Feb. 17 BEFORE 2 am (late Monday night)
You should submit a hard-copy of the checkpoint to me (stapled and with your names on it).
You should submit parts 1-3 when you submit part 4. You can do so electronically (in ASCII, postscript or pdf) with the part 4 files you tar up and submit via cs75handin, or you can give me a hard copy of your solution to parts 1-3 (if you changed it since the checkpoint, you should submit your new version of parts 1-3).
int x; x = atoi("1234"); // x gets the integer value 1234For character values, if you just assign the value of character to a variable the variable's value is that of the ascii value of the character:
x = 'c'; // x gets the asscii value of c, which is 99 x = '\n'; // x gets the asscii value of '\n', which is 10Since your lexer is reading in one character at a time, for the first case you need to build up a string of digit characters (remember that in C you need to explicitly null terminate any string you build up one character at a time), and for the second case you need to grab the appropriate character (the special characters with backslashes (e.g. '\n') need to be handled differently have to be handled differently since (i.e. "\n" and '\n' are two very different things in C). Remember that your lexer will read in the character ', then the character \, then the character n, and then the character ' for the case of the character literal '\n'.
1234 0 23 'p' '\n'You may assume that all int values will appear in decimal (i.e. 10 as opposed to 0xa); you do not need to recognize int values expressed in hexidecimal (e.g. 0xf13e) as a NUM, although it should not be to much extra work to add support for hexidecimal, and you can certainly add support to your compiler for this if you'd like.
Because the only two types supported by C-- are int and char, you do not need to worry about numbers with decimal points or numbers that use the exponent format (i.e. 23.5 is not a valid NUM in C--).
In C, characters are represented internally as their ascii value, and this is what you will do in C-- as well. See the man page for ascii if you are interested in seeing the ascii values. In C if you do the following you can see that x stores the value 98 (the ascii value of the character 'b') it does not somehow store a ' and a b and a ':
x = 'b'; printf("%d\n", x);
/* "*/" */
is an invalid comment in C--).
This form was from problem 3.3.5 in hw2, and it is not a comment
form you need to support (gcc doesn't). I've added a note about
whitespace and comments to the "Getting Started" section to help
clarify this.
Project Parts:
lexan
that returns the next token read from a file stream.
'a'
should be
read as an integer token with value 97
, the ASCII code
of the character.
/* this is a comment that can go for several lines */ // this is a single line comment, where everything to right of // is ignoredDo not worry about handling comments with a */ between matching parens (like the problem in hw2).
Off the lab 0 page are links to information about using svn, make, gdb, valgrind.
For this assignment, you will be adding code to the lexer and includes subdirectories in the starting point.
Whitespace characters in C are: the space character (' '), tab ('\t'), form-feed ("\f"), newline ('\n'), carriage return ('\r'), and vertical tab ('\v'). (see the man page for isspace)
There are two forms of C style comments that you need to support:
// everything after slash slash to the end of the line ('\n') is a comment /* everything between matching slash-star and star-slash is a comment (this style can can span multiple lines) */You do not need to handle comments where */ appears inside double quotes within the comment (i.e. /* "*/" */ is not a valid comment in C--, it was described as a valid comment in problem 3.3.5 from hw2). You may assume that as soon as you read in a */ (after reading in /*), you are at the end of the comment. gcc doesn't recognize the special case from problem 3.3.5 as a valid C comment (thanks to Meggie and Adam who tried it out), and even if it did, you do not need to handle this crazy case in your comment DFA.
I suggest you look at the code in simplecompiler as hint at how to
structure your lexical analyzer code:
~newhall/public/cs75/simplecompiler/
Your lexical analyzer does not need to create a symbol table (we will add this in a later part of the compiler).
Your emit function should ouput TOKEN.attr for every token. Some tokens may not have attributes, but for those that do, your lexical analyzer should output their values.
For example, if the input is: if (x1 >= 'a') The output of your program should look something like: IF LPAREN ID.x1 GE NUM.97 RPAREN DONEYour lexical analyzer should take a C-- source code file as a command line argument:
% ./lexan foo.c-- # assuming lexan is the name of my LA executableTo do this use argc and argv parameters to main (main.c in the staring point code you grabbed in lab 0 has an example of how to do this):
int main(int argc, char *argv[]) {
You can use fgetc
or getc
to read one character at
a time from the input file, and ungetc
to put-back one character
the input stream. See the man pages for more details about using these.
Make sure to test your lexical analyzer on both valid and invalid C-- programs (think about the types of errors a lexical analyzer can and cannot detect).
script
command to capture all of a
shell's stdin, stdout, stderr to a file, and dos2unix
to clean-up the resulting typescript file)
The easiest thing to do is to create a tar file out of your top-level project subdirectory of your svn repository (you could co a clean version to tar up or tar up a cleaned-up version of one of your checked out version):
$ cd ~/cs75/CS75_inits/project $ vi README.proj1 # add your README file $ make clean $ cd .. $ tar cvf proj1.tar project
In addition, code you write should be easily extensible to allow for larger
problem sizes. For example, if you use a fixed-sized char array to store
the lexeme of the current token, then use a constant to define the size
of this array (e.g. #define MAX_LEXEME 512
) and use the
constant rather than its value in your code (e.g.
x = MAX_LEXEME;
rather than x = 512;
).
This way, your program can be easily modified
to handle a larger max lexeme size by making just one
change to the constant's definition.
Program ------> VarDeclList FunDeclList VarDeclList --> epsilon VarDecl VarDeclList VarDecl ------> Type id ; Type id [ num] ; FunDeclList --> FunDecl FunDecl FunDeclList FunDecl ------> Type id ( ParamDecList ) Block ParamDeclList --> epsilon ParamDeclListTail ParamDeclListTail --> ParamDecl ParamDecl, ParamDeclListTail ParamDecl ----> Type id Type id[] Block --------> { VarDeclList StmtList } Type ---------> int char StmtList -----> Stmt Stmt StmtList Stmt ---------> ; Expr ; return Expr ; read id ; write Expr ; writeln ; break ; if ( Expr ) Stmt else Stmt while ( Expr ) Stmt Block Expr ---------> Primary UnaryOp Expr Expr BinOp Expr id = Expr id [ Expr ] = Expr Primary ------> id num ( Expr ) id ( ExprList ) id [ Expr ] ExprList -----> epsilon ExprListTail ExprListTail --> Expr Expr , ExprListTail UnaryOp ------> - | ! BinOp --------> + | - | * | / | == | != | < | <= | > | >= | && | ||