CS75: Project 1

Introduction

Do an update75 to get the starting point files for this project. This will create the directories cs75/projects/1a and cs75/projects/1b.

Implementing the code for the lexical analyzer is straight forward once you have developed a correct DFA that accepts all of the tokens in the language. Therefore, you will complete parts 1-3 (given below) and turn them in before beginning on the implementation in part 4.

List the set of tokens to be returned by your lexical analyzer in the file:
cs75/projects/1a/tokens

Define regular expressions for this set of tokens in the file:
cs75/projects/1a/regular-expressions

Derive a single DFA from your regular expressions. Turn in a hand-written drawing of your DFA in the slot outside my office door or provide a computer-generated drawing in the 1a directory.

Implement the DFA in a Python program similar in structure to the scanner example provided for the infix to postfix translator we discussed in class. Specifically, each state of your DFA should be implemented as a method in your LexicalAnalyzer class. Your implementation should be in the file: cs75/projects/1b/scanner.py

Specifications

The lexical analyzer should be implemented as a python class. The constructor should take the name of the file to be scanned and an instantiated symbol table.

The lexical analyzer should provide a method called getToken that returns the next token (and any associated value) read from a file.

Keyword tokens should be distinct from identifier tokens, and different identifier tokens should be distinct from one another. These distinctions should be made with the help of the symbol table.

The lexical analyzer should convert integer tokens, including characters, into numeric values. For example, 'a' should be read as an integer token with value 97, the ASCII code of the character. Use ord in Python to determine the ASCII value of a character. It should also be able to handle the common escape characters such as backslash n (for newline) and backslash t (for tab).

The lexical analyzer should ignore comments which can be of the form:
```
/* 
  a comment
  that can go for several lines
*/
// a single line comment, where everything to the right is ignored
```
When a comment is encountered, the lexical analyzer should return the next valid token following the comment. Comments should NOT produce tokens.

At the end of the file, the lexical analyzer should return a special 'done' token.

The lexical analyzer should report any error it finds by line number and continue trying to scan the file.

Testing your scanner

Below is an example of C-- code that can be used to test your scanner. Note that your actual token names do not need to be the same as the ones shown in this examples.

// this will be ignored
int main() {
  x = 0;
  _abc = 8;

  /* this too 
     will be
     ignored */

  if (x & 10) 
    write x;
  else
    ;

  return 0000042
}

Notice that there are a number of errors in this code, some can be caught by the scanner, some by the parser, and some can only be recognized in code generation. For example the last statement is missing a semi-colon. This error can be recognized by the parser, but not by the scanner. Also, variables must be declared at the top of a block in C--, but the variables x and _abc have not been declared. This error can be caught in code generation, but not by the scanner. However the scanner can recognize that the && operator must consist of two ampersands. In addition, the scanner can recognize that an integer may not begin with leading zeros. In both of these cases, the scanner returns an err token with an associated message.

Here is the type of output that should be produced by the scanner for the above test file.

int
id main
lparen
rparen
lbrace
id x
assign
num 0
semi
id _abc
assign
num 8
semi
if
lparen
id x
err Line 10: missing ampersand in and
num 10
rparen
write
id x
semi
else
semi
return
err Line 15: integers cannot have leading zeros 0000042
rbrace
done

Submit

Run handin75 by the end of Monday, Jan. 31 to turn in parts 1-3. Run it again by the end of Monday, Feb. 7 to turn in part 4, the implemented scanner.

CS 75 Project 1: Scanner