CS75 Project 2: Parser for C--

Checkpoint (10%): Due Thursday March 17 at the beginning of class
(you may not use a late day on the checkpoint)

Complete Project (90%): Due Tuesday March 29, BEFORE 9:30 am
Your project report may be submitted on-line as part of the tar file in ASCII, pdf, or postscript format, or may be submitted in class on the same day as you submit your tar file.


Problem Introduction and Specifications
Sample input and parser output
What to Hand in

Introduction

Design and implement a Recursive Decent parser for C-- as follows:
  1. Convert the C-- grammar from project 1 to an LL(1) grammar
    If you want to use latex to write up your solution, you can grab a copy of the grammar.tex file and a makefile for building .dvi, .ps, .pdf versions here:
    /home/newhall/public/cs75/grammar_latex/
    
  2. Implement a recursive decent parser that recognizes the LL(1) grammar. Use your project 1 solution as a starting point.
  3. Add simple error recovery.

Specifications:

As you convert the C-- CFG to LL(1), be careful that you are not changing the language that is accepted by the grammar. There is one part of the C-- CFG that cannot be converted to LL(1). The problem occurs in some forms of expressions involving the assignment operator. The difficulty is in determining if the expression on the left-hand-side is a valid lvalue at parse time. For example, these are all valid C-- expressions:
  
        a[i];	          // a[i] is not an lvalue here
        a[i] = x;         // a[i] is an lvalue here
	x = 8 + (y = 6);  // y is an lvalue here and (y = 6) is an expression with a value of 6
but these are invalid C-- expressions:
  
        a[i] + b[i] = x;    // (a[i] + b[i]) is not a valid lvalue
	x = 8 + y = 6;      // (8 + y) is not a valid lvalue 
                            // ( + has higher precedence than = ) 
Correctly handing these cases requires more than one token look-ahead or one or more token look-behind, neither which is allowed in a predictive parser. For now, the solution is to go ahead and translate the CFG to an almost equivalent LL(1) grammar. The resulting LL(1) grammar will be equivalent to the CFG for C-- with the exception that it will incorrectly accept strings like a[i] + b[j] = expr;. Later, when you implement the code generator part of the project, you will take care of ensuring that an expression appearing on the left-hand-side of an assignment statement is a valid lvalue; at the code generation phase, your compiler will reject strings like "a[i] + b[j] = expr;" that your parser accepts.

The precedence of C-- operators from lowest to highest is (your LL(1) grammar must handle precedence):

  1. assignment (=)
  2. or (||)
  3. and (&&)
  4. equal, not equal (==, !=)
  5. less than, less than or equal, greater than, greater than or equal (<, <=,>, >=)
  6. addition, subtraction (+, -)
  7. multiplication, division (*, /)
  8. not, unary minus (!, -)
  9. parentheses ()

Converting the CFG for C-- to an LL(1) grammar will be the most difficult part of the assignment. I strongly suggest first testing out your LL(1) grammar by hand to verify that its rules can correctly parse some very small C-- programs. Trying some very small examples with with zero, one, and two global variable declarations and one, and two very small functions. For example:

int x;
char y;

int main(int x, char y) {
  x = y;
}
In addition, I encourage you to compute FIRST and FOLLOW sets for all or part of your LL(1) grammar to prove that it is LL(1).

Once you have verified that your LL(1) grammar is in fact LL(1), and you have convinced yourself that it still parses C-- (you didn't modify the productions in such a way as to change the language it recognizes), then implement a recursive decent parser from your LL(1) grammar.

You should make sure to test that your parser can correctly parse all valid C-- language constructs and combinations of C-- language constructs before you start adding error recovery code.

Once your parser works, add some simple error recovery to your parser. You should implement simple panic and phrase-level recovery schemes (for a single missing or incorrect token); you do not have to implement complex error recovery, instead find simple cases of recovering from a single missing token during parsing. In addition, you should add error recovery to your lexer in cases where a missing or incorrect character(s) in the input can be easily handled (e.g. a missing '&' in the AND operator). When your compiler detects an error it will print out an error message, then either recover from the error and continue parsing or exit if it cannot recover. For example, after detecting the following error, the parser will print out an error message and continue parsing as if the missing RPAREN was there:

ID  LPAREN  ID      SEMI 
               ^^^^
                missing RPAREN in function call 

You are welcome to try more complicated error recovery, but first try implementing some relatively easy cases.


Sample input and output

Your parser will not actually construct a parse tree. Instead it should display EVERY non-terminal in your LL(1) grammar as it is being processed and EVERY terminal when it is matched. For example, when given the following program:
int main() {
  int x;
  x = 5 - 8 / 4;
  write x;
}
The parser should output something like the following:
# 
# Note: I've removed parts of my output that correspond to non-terminals in my LL(1)
# that are not from the original CFG.  I'm doing this because I want you to 
# convert the CFG to LL(1) without trying to see if your LL(1) is the same as 
# mine based on my sample output.
# 
# In your solution, you should print out EVERY non-terminal in your LL(1)
# grammar as it is parsed by your parser.
# 
program			      
MATCH: INT	      
MATCH: ID.main	 	      
MATCH: LPAREN		      
ParamDeclList		      # when your parser matches a token you should
MATCH: RPAREN		      # print out:  "MATCH: TOKEN.attribute" 
MATCH: LBRACE		      # in my grammar I actually have terminals for
Block			      # each keyword (ex. IF, WHILE, ...), however
VarDecList		      # I'm printing them out here as KEYWORD.attr
MATCH: INT     		      # you are welcome to do either
MATCH: ID.x
MATCH: SEMI
VarDecList
StmtList
Stmt
Expr				
MATCH: ID.x
MATCH: ASSIGN
Expr
MATCH: NUM.5
MATCH: MINUS
MATCH: NUM.8
MATCH: DIV
MATCH: NUM.4
MATCH: SEMI
StmtList
Stmt
MATCH: KEYWORD.write
Expr
MATCH: ID.x
MATCH: SEMI
MATCH: RBRACE
FuncDecList
MATCH: DONE

What to Hand in

Checkpoint

In class, submit a hard copy of your LL(1) grammar. If there is a part of your grammar you want me to specifically check, please indicate which part(s) and include the steps you took in converting the CFG to LL(1).

You do not need to include with your checkpoint a copy of the steps you took to convert G to LL(1). However, if you do, it will be more likely that I will be able to find any errors in your conversion for you (and actually, you will be less likely to make them). I suggest that you compute FIRST and FOLLOW sets for parts of the grammar (like arithmetic expressions, and the rules with lists to see if your grammar is LL(1)).

Project

  1. A tar file to be submitted using cs75handin containing:
    1. All the source header files and makefile I need to build and run your parser.
    2. A README file with:
      1. Your name and your partner's name
      2. The number of late days you have used on this assignment
      3. The total number of late days you have used so far
    3. A sample of your parser's output on a good (and short) test C-- program that you devised. Annotate the output listing the parts (statement(s) of the C-- program) are being parsed at different points in the output. See my Unix documentation about script and dos2unix for recording terminal I/O to a file.
    4. A copy of the test C-- program for which you submitted sample output.

  2. A Project Report (may be submitted on-line as part of your tar file or as a hardcopy in class)
    Your project report should contain:
    1. A listing of your LL(1) grammar
    2. A description of your error recovery
    3. A description of any parts of the language that you were unable to parse