Compilers

Egg-Eater

Source: Wikimedia Commons: by Dawson [CC BY-SA 2.5]

Due on Monday, March 20th at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.

If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.

If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.

In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the Piazza forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

In this lab, we will extend the Diamondback compiler to a new language called Egg-eater. Egg-eater includes all of the functionality of Diamondback as well as heap-allocated tuples.

Transitioning to Egg-Eater

When you git pull, you’ll discover two new items:

src/languages/eggEater/, the directory containing our lexer, parser, and ASTs.
resources/printer.c, which has been adjusted to accommodate tuples.

As usual, we will begin by changing the LANGUAGE variable in language.mk from diamondback to eggEater. Immediately after doing so, you should run make clean, make tests, and ./tests.byte. Your tests from Diamondback should run correctly using the Egg-Eater parser, although you’ll get a lot of warnings about not handling the cases for tuples.

The Egg-Eater Language

As with previous languages, we will give the syntax and semantics for Egg-Eater. The only difference is the addition of tuples and operations performed on them, although these new features have broad implications for our compiler which are discussed below.

Concrete Syntax

Rather than show the full Egg-Eater syntax, it is clearer to highlight the new expressions for tuples:

<expression> ::=
  | ...
  | (<expression>, <expression_list>)
  | <expression>[<expression>]
  | istuple(<expression>)

<expression_list> ::=
  | <expression> , <expression_list>
  | <expression>

That is, our new grammar allows:

A comma-separated list of at least two expressions surrounded by parentheses (e.g. (4,5,6))
A bracketed indexing notation similar to Python (e.g. my_tuple[1])
A built-in function to determine if a value is a tuple at runtime

The remainder of the concrete syntax is as in Diamondback. Operational precedence is the same as in Diamondback but favors tuple indexing; that is, x + y[1] parses as x + (y[1]) and not as (x + y)[1] (the latter of which is never what we want).

Abstract Syntax

As usual, you can see the definitions of expressions for Egg-Eater in src/language/eggEater/expressions.ml. Make sure to read over those definitions.

Semantics

Tuple expressions in Egg-Eater construct new tuples. Tuple indexing expressions retrieve a value from the tuple where the leftmost element has index 0. For instance,

let t = (1,3,6,10) in
t[2]

evaluates to 6 because the tuple t has a 6 at index 2 (the third position). The istuple operator determines whether a value is a tuple or not, similar to isbool and isint. That is,

let t = (1,false,4) in
let n = 5 in
(istuple(t), istuple(n))

evaluates to the tuple (true, false). Note that tuples can contain any other kind of value (including other tuples!). Sub-expressions in a tuple evaluate from left to right; for instance,

let t = (print(1),print(2)) in
t[0]

must print

1
2
1

where the last 1 is produced by the driver showing the result of the program.

Runtime Errors

There are two new forms of run-time error which can occur in Egg-Eater:

The user could attempt to use access a non-tuple value by index (e.g. (1+2)[1]). We will use exit code 4 in this case.
The user could attempt to access a tuple using an invalid index (either negative or too large, e.g. (1,2,3)[8]). We will use exit code 5 in this case.

Your handling of the errors should be in the following order:

If the first expression is not a tuple, fail with exit code 4.
Otherwise, if the second expression is not an integer, fail with exit code 1.
Otherwise, if the second expression is negative or too large, fail with exit code 5.

These runtime errors should be reported in addition to the runtime errors from the Diamondback lab. Note that tuples are a new kind of data type which is neither an integer nor a boolean, so every operator that expects an integer or a boolean should produce the correct error if passed a tuple. For instance, (1,2) + 3 should produce the same error as true + 3.

The Heap

With the introduction of tuples, we are no longer able to store every value in 32 bits: tuples like (500, 1067) will require a different approach. We will store all tuples in Egg-Eater on the heap rather than the stack; we’ll then store a 32-bit pointer to those heap values in normal stack variables. To do this, of course, we’ll need to have some heap memory.

Creating the Heap

As with printing and runtime errors, memory allocation will require us to interact with the system. Rather than getting into system-specific memory allocation details, we will simply allocate a large block of memory in our C driver and pass it as an argument to snake_main.

To allocate the memory, use the C function malloc from the stdlib.h library. Make sure that you allocate enough memory for at least a million 32-bit values (e.g. sizeof(int)*1000000). In your driver’s main function, create an int* with malloc and pass it to snake_main as the only argument. You’ll need to update the declaration of snake_main in the same file.

Global Variables

We need to have access to this pointer throughout the execution of our snake_main function, so we need a global variable to store it. A global variable may be defined in the .data section of an x86 assembly file like so:

section .data
align 4
heap_cursor:
  dd 0

Line by line, the above code states:

section .data: The following lines are part of the data section of your program. We didn’t use a data section before, but we are introducing one now to allow the creation of global variables for this and future labs.
align 4: Everything in this section should be aligned to a 4-byte boundary.
heap_cursor:: A label identifying this location in memory. This works just like a label in the .text section of your code.
dd 0: Allocates a single 4-byte unit of memory containing the value 0.

Your code will require a single global variable to store the current heap pointer. You can access the content of this global variable by using the same indirection notation as with register addresses:

mov eax, [heap_cursor]

To support this, you may wish to add the following to your assemblyLanguage.ml file:

An instruction constructor XSection. This is a pseudo-instruction: it doesn’t wind up in your object file directly, but it gives the assembler some information about how to write the object file’s contents.
An instruction constructor XAlign, which is a pseudo-instruction like XSection.
An instruction constructor XDd. You can have any type of argument you like: a string list suffices, or you may want a list of your own memory constant types (e.g. raw_data list).
An address constructor XAddressByLabel. This is the constructor which would produce the [heap_cursor] text above.

Initializing the Heap Pointer

At the start of snake_main, the heap pointer is provided to you by the driver as a function argument. You should retrieve that argument and store it in heap_cursor. This should be enough to get started.

Heap Layout

Every value in an Egg-Eater program’s heap memory is a tuple. Each tuple of \(n\) elements requires \(n+1\) words of memory to store; the first word of memory is a machine integer (not an Egg-Eater integer) describing the number of elements in the tuple. For instance, the 3-tuple

(7,11,13)

will allocate 16 bytes on the heap which will be laid out as follows:

(32 bits)	(32 bits)	(32 bits)	(32 bits)
`0x00000003`	`0x000000e`	`0x00000016`	`0x0000001a`

The first value, 0x00000003, indicates the number of elements in the tuple. The remaining three values are the Egg-Eater values stored in each location (using the same representation of values that we use in local variables and on the stack).

Representing Tuples

Although the above gives us a plan for using the heap memory, we also need to figure out how to refer to it. To do this, we will have to modify the binary representation we used in Diamondback. Previously, we only had integers (ending in 0) and booleans (ending in 1). We’ll now use the following representations:

0xZZZZZZZ[zzz0]: the integer 0xZZZZZZZzzz, as in previous labs.
0xFFFFFFF[1111]: the boolean true, as in previous labs.
0x7FFFFFF[1111]: the boolean false, as in previous labs.
0xZZZZZZZ[zz01]: a pointer to the heap memory 0xZZZZZZZzz00.

For instance, 0xabeefbed (which is 0xabeefbe[1101]) is a pointer to the memory location 0xabeefbec (which is 0xabeefbe[1100]). This allows us to point to any memory location on a 4-byte boundary, which is acceptable since all of our values are 4 bytes in size.

Note that this representation gives us the same guarantees about integers but not about booleans. It is no longer sufficient to check the last bit of a word to determine if it is a boolean value; we must now check that the last two bits are 11 to decide if a value is a boolean. You may use a jump to do this; you might also choose to use xor, which sets the zero flag (used by jz and jnz) if the result of the xor is zero. In either case, this means you need to reconsider thigns like isbool and your boolean runtime checks.

To create a tuple, you need only allocate the appropriate amount of heap memory, set the first word to the size of the tuple, and copy the tuple’s contents to the rest of the allocated memory. The variable heap_cursor should contain a pointer to the next byte of free heap memory at all times. Don’t forget to set the last two bits of your pointer to 01 so it can be recognized as a heap pointer. Also, don’t worry about running out of memory; Egg-Eater programs that consume too much memory will simply crash, which is why we’re starting with such a large heap.

Accessing Tuple Contents

To access the contents of the tuple, we must perform several operations.

First, check to ensure that the item in question is actually a tuple. As described above, expressions like 4[5] should terminate with an appropriate error.
Verify that the index is an integer. Expressions like (1,2)[true] should terminate with an exit code of 1, since we expected but did not get an integer.
Mask off the last two bits of the pointer. The 01 tag bits will be used to indicate that the value is a pointer, but we don’t actually want to point to that location in memory!
Next, verify that the index is in bounds. Be careful here: the tuple size is in machine form but the index is in the Egg-Eater representation. If the index is negative or is too large, the program should terminate with the appropriate error described above.
Calculate the location of the element you want and copy it into eax.

The ordering of the above is important because it controls which error code you generate in situations where multiple errors exist. For instance, 8[true] should return exit code 4 (because 8 is not a tuple), not exit code 1 (even though true is not an integer).

The `ecx` Register

It’s quite tedious to perform the above calculation with only one register! For this lab, you should add the ecx register to your register data type. This way, you can store the address of a tuple in one register and the index of the tuple in the other.

x86 assembly supports a form of infix address computation like the following:

mov eax, [eax + ecx * 4]

This can be quite handy for e.g. retrieving values from tuples. If you like, you can use this syntax by adding a new constructor to your address data type (named e.g. XAddressByRegisterProductOffset) so that you can use this addressing form in your generated code.

Implementing the Egg-Eater Compiler

There are a few different tasks at hand to implement Egg-Eater, but many of them are not very interdependent. For this reason, the organization of the lab work is left to you. See below for a summary of the work you’ll need to do.

It will be especially helpful to use tools like gdb to debug your compiler. Refer to the Diamondback lab for suggestions.

Summary

To complete this assignment, you’ll need to

Update your driver and snake_main to track free heap memory in the variable heap_cursor
Add assembly code generation for tuples and tuple operations
Update your existing code that interacts with booleans to account for the change in tag bits
Add runtime error checking for when existing operations receive tuples

Submitting

To submit your lab, just commit and push your work. The most recent pushed commit as of the due date will be graded. For more information on when assignments are due, see the Policies page.

If You Have Trouble…

…then please contact your instructor! Piazza is the preferred method, but you can reach out via e-mail as well. Good luck!