Compilers

Bluebird

bluebird
Bluebird

Due on Thursday, February 17th at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.

If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.

If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.

In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the course forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

In this lab, we will extend the Auklet compiler to a new language called Bluebird. The Bluebird language includes boolean values and operations. Importantly, the Bluebird runtime — the general set of conventions and invariants we use when compiling and running Bluebird programs — will use a distinct binary representation in order to differentiate boolean values from integer values. Since we are extending your previous compiler, you will use the same Git repository as you did for your previous assignment.

Note that you need to have finished Auklet before working on this assignment!

Migrating to Bluebird

To migrate from one compiler to the next, we will use a feature of Git called Git branches. A branch names a point in time in your repository; your auklet branch, for instance, points to the last commit that you made when working on the Auklet compiler. Branches are often used to keep track of multiple lines of development in a repository; a project may have a “stable” branch and an “experimental” branch, for example, where most development is done in the latter and distributions of software are done in the former.

We will use branches in this course to keep track of the different versions of your compiler. Your Bluebird starter code is in a branch called bluebird which has been stored in your GitHub repository. First, you need to get the information about that branch by running

git fetch

in your compiler directory. You can then switch to the bluebird branch by running

git checkout bluebird

This causes Git to change all of your files to the most recent commit on the bluebird branch; this branch exists because your instructor created it for you. The bluebird branch currently contains only the starter files; it doesn’t contain any of the work you did to complete the Auklet compiler. To bring that work into the bluebird branch, we can now run

git merge auklet

This will combine all of the changes in auklet (the work you did on the last assignment) with the starter code in bluebird. Once you’re finished, you’ll be on the bluebird branch and you’ll have all of your Auklet code. At this point, you should be able to make clean, make, and run ./tests without seeing any difference in how Auklet worked. (You will, however, see some compiler warnings.)

This merge won’t change a lot. Most notably, it updates src/language/asts.ml file now contains the Bluebird AST constructors and the corresponding parser reads Bluebird syntax. It also adds files resources/printer.h and resources/printer.c which we will discuss later in this writeup.

The Bluebird Language

As in the previous lab, we introduce the language you will compile in three parts: the concrete syntax, the abstract syntax, and the semantics. The Bluebird language extends Auklet to include let bindings and if expressions with similar syntax to OCaml.

Concrete Syntax

The concrete syntax of Bluebird is:

<expression> ::=
  | true
  | false
  | <integer>
  | after(<expression>)
  | before(<expression>)
  | isbool(<expression>)
  | isint(<expression>)
  | <expression> + <expression>
  | <expression> - <expression>
  | <expression> * <expression>
  | <expression> < <expression>
  | <expression> > <expression>
  | <expression> = <expression>
  | <expression> && <expression>
  | <expression> || <expression>
  | <identifier>
  | let <identifier> = <expression> in <expression>
  | if <expression> then <expression> else <expression>
  | (<expression>)

This grammar is a superset of the Auklet grammar; that is, every Auklet program is also a Bluebird program. But Bluebird programs additionally admit the boolean constants true and false, the runtime type operators isbool and isint, a series of additional binary operators, and if expressions.

Syntactic precedence is as in OCaml. For instance, multiplication has higher precedence than addition and subtraction, which have higher precedence than let. Concretely, the code if b then 4 else 2 + 6 parses to the same expression as if b then 4 else (2 + 6); the code (if b then 4 else 2) + 6 has a different meaning and cannot be written without parentheses. Here is the order of operator precedence from highest to lowest:

It’s always okay to add parentheses when writing Bluebird programs if you’re not sure about operator precedence.

Abstract Syntax

The abstract syntax of Bluebird is still defined in src/language/asts.ml. It now includes constructors for the new expressions, unary operators, and binary operators shown above.

Semantics

The semantics of many of these expressions are similar to those of OCaml; we discuss the semantics of novel forms below. Consider the following examples:

Concrete syntax Abstract syntax Evaluated result
1 < 2 EBinaryOp(OpLessThan,EInt(1),EInt(2)) true
1 = 2 EBinaryOp(OpEqualTo,EInt(1),EInt(2)) false
true && false EBinaryOp(OpAnd,EBool(true),EBool(false)) false
isbool(false) EUnaryOp(OpIsBool, EBool(false)) true
isint(true) EUnaryOp(OpIsInt, EBool(true)) false
true || 1 EBinaryOp(OpOr,EBool(true),EBool(false)) undefined
if true then 2 else 3 EIf(EBool(true),EInt(2),EInt(3)) 2
if 1 = 2 then 5 else true EIf(EBinaryOp(OpEqualTo,EInt(1),EInt(2)),EInt(5),EBool(true)) true

These examples reveal some semantic details of Bluebird; we discuss them in more detail below.

Undefined Behavior

Bluebird has two types of values: integers and booleans. Some operators in Bluebird expect their arguments to be integers while others expect booleans; likewise, some operators produce integers while some produce booleans. For instance, the less-than operator < expects to receive integers but always produces a boolean.

If an operator in Bluebird is given a type of argument it does not expect, then the behavior of Bluebird is undefined; this means that your compiler makes no promises about what will happen. As a result, you don’t have to worry about these cases. Undefined behavior puts a lot of burden on the programmer, though, so our next compiler (Cardinal) will define this behavior more precisely.

Runtime Type Tests

There are two new unary operators in this grammar: isint and isbool. These operators can take any value and always return a boolean. isint, for instance, will return true if its argument is an integer and false otherwise.

Conditionals

Bluebird supports if expressions. The first argument of the if expression is expected to be a boolean. If this boolean is true, then the second expression (following the keyword then) is evaluated; if this boolean is false, then the third expression (following the keyword else) is evaluated. Crucially, the second expression is not evaluated if the boolean is false and the third expression is not evaluated if the boolean is true. Make sure to use the approach we discussed in lecture to avoid evaluating both branches!

Binary Representation

In Auklet, all values were integers stored in 64-bit signed form. Bluebird has two types – integers and booleans – and, to exhibit the behavior described above, our Bluebird programs must be able to distinguish between integers and booleans at runtime. We will do this by choosing a memory layout that makes it easy to distinguish between and operate on those values:

Note that, in this representation, the rightmost bit of every integer is 0 and the rightmost bit of both booleans is 1. We will discuss below a strategy for adjusting the compiler to use this representation.

Implementing the Bluebird Compiler

Once you have migrated your repository from Auklet to Bluebird, the following steps are recommending when implementing your Bluebird compiler.

1. Update your assembly AST

You’ll need several x86-64 assembly instructions to implement these features. The Guide to x86 Assembly from the University of Virginia may be helpful here. You’re welcome to add anything you like to your src/compiler/assembly.ml file, but the following constructors are suggested:

Once you have added these instructions, you’ll have to propagate those changes forward into your utility functions like code_of_instruction.

The r10 register

Not every instruction in the x86-64 instruction set can be applied to every combination of arguments. In fact, most instructions cannot operate on 64-bit constant values. We can use add rax, 1 to add one to the rax register, but the constant value we provide cannot exceed the range of 32 bits due to hardware limitations.

The instructions we care about can operate on registers, though. If we needed to add a large value like 10,000,000,000 (which is outside of the range of 32 bits) to rax, we can’t just add it directly. However, we can move the value into another register and then use the register in the addition, like so:

mov r10, 10000000000
add rax, r10

Here, the r10 register is being used to store a large constant so that it can be used in addition. This will be especially important when you’re dealing with Bluebird values like true, which is represented as 0xFFFFFFFFFFFFFFFF: you’ll need to store it in a register before you can use it.

To support this, you should add an R10 constructor to your register variant and then make the appropriate consequential changes in src/compiler/assembly.ml.

2. Migrate to the new binary representation

Next, change your compiler to generate programs that use the Bluebird binary representation. First, change driver.c to use printValue instead of printf. The printValue function is in the printer.c file, so you’ll need to do three things:

  1. Add #include "printer.h" to driver.c, as is usual for C programs.
  2. Replace the printf call with a printValue call.
  3. Open src/compiler/builder.ml and add printer.c to the list runtime_code. This list contains the names of all of the C source files that we use in our compiled programs.

Once you’ve finished making these changes, run your tests again. They will fail because your programs are using a signed 64-bit representation of integers; meanwhile, the printValue function expects to print a Bluebird integer. You’ll know that things are set up properly, though, when the failures report even numbers as being half of what they’re expected to be.

Now that you have the resources linked to your compiler, make the necessary changes to produce programs using this representation. The program 4, for instance, should now compile to mov rax, 8. Note that we’ve mentioned said that our OCaml compiler only has 63-bit integers, but you may need to compute a 64-bit integer here if the user provides a sufficiently large number. Here’s a piece of code that will use the OCaml Int64 module to solve this problem, turning n into a string representation of twice its value:

let string_of_twice_int (n : int) : string =
  Int64.to_string (Int64.mul (Int64.of_int n) (Int64.of_int 2))

The Int64 module is a bit awkward, but it’s very helpful in situations like this.

Don’t worry about conditions or the new operators yet; we’ll deal with those next. Once you’ve finished these changes, your old Auklet tests should pass again.

3. Implement the new operators

You should now be in a position to implement the new unary and binary operators. Make sure to add and test these features one at a time: add in true and false and then test that they work properly; then add in isbool and make sure it works on different types of inputs; then add in isint; and so on.

Label Names

Each time you compile a conditional expression, you’ll need to pick unique label names. While this is possible in a pure functional style, it’s quite tedious and error-prone. Instead, your support code for the Bluebird compiler includes a file freshening.ml with a function fresh_name that can provide you with unique names. For instance, you might write

open Freshening;;

...
    let name = fresh_name "isbool_end" in
    ...

which might assign the name variable a value like "isbool_end_4" or "isbool_end_200" or something similar. This name is guaranteed to be distinct from any name you’ve gotten from a previous call to fresh_name; the string you provide is simply for convenience, as it might help you debug the output.

4. Implement conditionals

Finally, implement if expressions. Recall that, in x86-64 assembly, we use cmp to compare two arguments and then decide what to do about it as a separate instruction. For instance, the code if true then 5 else 3 might be written in x86 assembly like so:

  mov rax, 0xFFFFFFFFFFFFFFFF
  mov r10, 0xFFFFFFFFFFFFFFFF
  cmp rax, r10
  jne label_else
  mov rax, 10
  jmp label_endif
label_else:
  mov rax, 6
label_endif:

Here, the jne will jump when the comparison fails (meaning that the conditional expression did not evaluate to true); that would lead 6 to be stored in rax. Here, though, the comparison succeeds, so the jne instruction has no effect. The value 0 is stored in rax; then, we jmp over the else instructions to the end of the snippet to avoid overwriting the value of rax.

Testing the Bluebird Compiler

You are not required to write tests for this lab, but you definitely should! You still have tests from your first lab, so you can use those to test the Auklet language features. You’ll want to write some more tests for Bluebird to make sure that things are working correctly. A good way to deal with this is as follows: every time you feel like you’ve finished a new feature, write a test for it. Every time you find a bug, write a test that fails because of the bug and then fix the bug afterward. If you do this, you’ll (1) have a good way of running your code as you fix your bugs and (2) quickly build up a library of tests so you don’t have to worry about breaking things in the future.

Debugging with gdb

You can use gdb to step-debug your compiled programs! You can start by running gdb from the command line just as you would for a C or C++ program:

$ gdb output/program.run

When you are greeted with the gdb prompt, you might want to enter the following commands:

To step through your program, use si or stepi (not s or step); this will step one instruction at a time. You can likewise use ni or nexti to step past function calls and finish will allow you to leave a function call you are currently in.

Note that you can create a file in your home directory – ~/.gdbinit – that contains one gdb command per line. These commands will be run when gdb starts, which prevents you from having to reconfigure things every time you run a program. For instance, the following is a .gdbinit file used by your instructor:

set disassembly-flavor intel
layout regs

If you are using gdb in other classes, be aware that you can comment out the lines in .gdbinit with a prefixed #.

Summary

To complete this assignment, you’ll need to

Unit tests are not listed on the bullet points above, but you should think of them as a natural part of the process.

Submitting

It is not enough just to push your code. Due dates are flexible, so it is necessary for you to take action to make it clear which commit you would like to have graded. We will use Git tags for this. A tag is a way of giving a human-readable name to a particular commit. Unlike a branch, however, tags are expected not to change once they are created.

When you are finished working on this compiler assignment, commit and push your work. Then, once you are sure that there are no additional changes you need to make, run the following commands:

$ git tag bluebird-submission
$ git push --tags

This will create a tag named bluebird-submission and push that tag to the Swarthmore GitHub Enterprise server. Your work on that tag will be graded.

In addition to pushing and tagging your work, you will need to fill out a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.

If You Have Trouble…

…then please contact your instructor! The course forum is the preferred method, but you can reach out via e-mail as well. Good luck!