Compilers

Falcon

New Zealand falcon
Falcon

Due on Thursday, March 31st at 11:59 PM. This is a compiler lab. If you have a partner, the two of you will complete this lab as a team. If you do not have a partner, you are effectively in a team by yourself.

If you are working with a partner, please be familiar with the Partner Etiquette guidelines. You and your partner share a single repository and, barring unusual circumstances, will receive the same grade. If you experience any difficulties in your partnership, please alert your instructor as soon as possible.

If you are working alone, please don’t hesitate to seek help if you get stuck. Since there are no ninjas for this course and you don’t have a partner, your instructor is the only interactive resource available to you and is happy to help you through any problems you might encounter.

In either case, be familiar with the academic integrity policy! You are permitted to discuss high-level details of your compiler project with other students, but you should never see or share anything that will be turned in for a grade. If you need help, please post on the course forum or, if necessary, contact the instructor directly. Make sure to post privately if you are sharing code. If you have any doubts about what is okay and what is not, it’s much safer to ask than to risk violating the Academic Integrity Policy.

Overview

In this lab, we will extend the Eagle compiler to a new language called Falcon. Falcon introduces first-class functions and partial application. Optionally, you may also add anonymous functions.

Transitioning to Falcon

As usual, you’ll run some Git commands to prepare for Falcon:

git fetch
git checkout falcon
git merge eagle

The typical changes appear in src/languages/. You’ll also have an updated resources/printer.c to accommodate function closures.

Once you’ve merged the eagle branch into falcon, your code will not compile! This is expected. Adding higher-order functions to a language is a big change, so it’s going to take some work to get back to a functioning compiler. We will explain this below.

The Falcon Language

The syntax of Falcon function declarations and function calls is incompatible with the Dove syntax and the Falcon abstract syntax reflects this. We first present the syntax and semantics of Falcon and then discuss how to update the Eagle compiler incrementally to accommodate these changes.

Concrete Syntax

Falcon’s concrete syntax includes all of the features of Eagle, but function declarations and calls are written differently. For clarity, we present the full Falcon syntax here:

<program> ::=
  | <declaration_list> <expression>
  | <expression>

<declaration_list> ::=
  | <declaration> <declaration_list>
  | <declaration>

<declaration> ::=
  | def <identifier> <parameter_list> = <expression> end

<parameter_list> ::=
  | <identifier> <parameter_list>
  | <identifier>

<expression> ::=
  | <integer>
  | true
  | false
  | <identifier>
  | let <identifier> = <expression> in <expression>
  | if <expression> then <expression> else <expression>
  | after(<expression>)
  | before(<expression>)
  | print(<expression>)
  | isbool(<expression>)
  | isint(<expression>)
  | istuple(<expression>)
  | <expression> + <expression>
  | <expression> - <expression>
  | <expression> * <expression>
  | <expression> < <expression>
  | <expression> > <expression>
  | <expression> = <expression>
  | <expression> && <expression>
  | <expression> || <expression>
  | <expression>[<expression>]
  | <expression> <expression>
  | (<expression>)
  | (<expression>, <expression_list>)

<expression_list> ::=
  | <expression> , <expression_list>
  | <expression>

The notable changes from Eagle are:

Abstract Syntax

As usual, you can see the definitions of expressions for Falcon in src/language/asts.ml. Make sure to read over those definitions. In particular, note that ECall is no longer present. Instead, we have EAppl (for function “application”) which only carries two expressions. Multiple arguments are passed by chaining EAppl constructors (e.g. EAppl(EAppl(EVar f, EVar x), EVar y) for the code f x y).

Semantics

Unlike our previous languages, function declarations in Falcon create new variables named after those functions. These variables are bound to closures which define the function and the arguments which have been passed to that function so far. For instance,

def f x y =
  x + y
end

let g = f in
g 2 4

evaluates to 6. The variable f is bound to the closure of function f with no arguments. These closures will be values in Falcon and, as such, can be assigned to other variables (such as g). Arguments can then be applied to the closure to call the function.

Falcon supports partial application, so the code

def f x y =
  x + y
end

let increment = f 1 in
(increment 3, increment 7)

evaluates to the tuple (4, 8); the increment closure carries the argument 1 for parameter x and this argument is used each time increment is called in the tuple.

While we will represent closures in a fashion similar to tuples, closures are not tuples; they are a new kind of value. That is,

let identity = fun x -> x in
istuple(identity)

will evaluate to false.

Errors

Since we have now introduced higher-order functions, some of the errors which were easy to recognize in Dove have become extremely difficult to recognize in Falcon. You are no longer required to recognize the following compile-time errors:

You are also now responsible for an additional type of runtime error:

For instance, exit code 5 would result from the code (1 + 2) (3 + 4) because the evaluation of the first expression, 3, is not a function. This should be reported after evaluating both the function and the argument. For instance, the code 0 (1 + true) would return exit code 1 (“integer expected”) and not exit code 5 (“closure expected”) because, as in previous labs (e.g. Eagle), we will evaluate both subexpressions before checking their results. This runtime error should be reported in addition to the runtime errors from previous labs.

Implementing the Falcon Compiler

To begin your work on Falcon, you should get your code to a point where it compiles, even if this means losing some functionality. The following is recommended:

Once you have done this, all of your unit tests which don’t use function calls should be able to pass. You will eventually need to update the syntax of your test files to match the new syntax of Falcon.

Step 1: Closures

Once you’ve gotten your code to compile, you’re ready to add closures to your compiler. A closure must be able to store:

We will lay closures out in memory in that order. As with tuples, we will always store closures in the heap (since they require several bytes of memory). To distinguish closures from tuples, we will set the first bit of the first word of the closure to 1. Consider the following example:

(64 bits) (64 bits) (64 bits) (64 bits) (64 bits)
0x8000000000000002 0x0000000000000004 0x00000000080484f8 0x7FFFFFFFFFFFFFFF 0x000000000000001e

This sequence of words represents a closure in the Falcon heap. They are:

Creating closures for declared functions

The first step toward implementing closures in Falcon is to create a set of global variables containing the zero-argument closures for the functions declared by the program. We are creating global variables here for two reasons:

In the Eagle lab, you created a single global label heap_cursor to track heap memory. You should now create an additional label for the closure of each declared function in the program. For instance, for the program

def f x =
  x + 1
end

def g x y =
  x * y
end

f (g 2 3)

the data section of your assembly might look something like this:

section .data
align 8
heap_cursor:
  dq 0
align 8
closure_of_f:
  dq 0x8000000000000000, 1, fun_f
align 8
closure_of_g:
  dq 0x8000000000000000, 2, fun_g

Note that these initial closures always carry zero arguments. The second word (e.g. 1 or 2) indicates the number of parameters the function declares while the third word (e.g. fun_f or fun_g) is the label given to the user-declared function. The label, named by itself, refers to the memory address of the labeled code.

Checkpoint

Once you’ve completed the above steps, you should be able to write a program which declares a function but does not use it. Your compiler should work correctly on this program and produce a data section similar to the above.

Step 2: Binding Variables to Closures

Once you’ve declared the existence of these closures, your next step should be to have some means to refer to them. When a Falcon program refers to a function by its name (e.g. f), you should be able to retrieve that closure. To accomplish this, we’ll want to add bindings to the environments we use in compilation that relate the names of functions to the memory location in which their closures are stored. You’ll likely want to add code to environment.ml to support this at some point; use your own judgment about how to create the environments.

We must ensure that each function declaration results in a single variable binding. So far, we’ve written our compiler to start with an empty environment in the main expression and an environment containing parameters for each function body expression. We will add to both of these environments a binding for each declared function. For instance, in the program

def prod x y =
  x + y
end

prod

the main expression should be compiled with an environment which binds prod to its closure. The assembly code for evaluating prod in the above program might be

mov rax, closure_of_prod + 1

Note the use of + 1. The label closure_of_prod should be aligned to an eight byte boundary, so the last three bits of its address will be 000. If we add 1 to this, we get a value which conforms to the Falcon binary representation for a memory pointer.

To handle the above, you probably want to add a new assembly argument to assemblyLanguage.ml. The name ArgLabelOffset is recommended as it is consistent with the previous names we have chosen.

In addition to updating compiler.ml with these new bindings, you will need to update wellformedness.ml so that it knows that these variable names are legal. Make sure to update both files before proceeding.

Checkpoint

Although you cannot yet call these functions, it should now be possible for you to produce the closure as a value. Write a program which declares a function; then, allow the main expression for that program to be just the name of the function (as above). Using the new version of printer.c you received as part of the starter code for Falcon, you should get a result that looks something like this:

<closure@00000000080484f8>[0/2](?,?)

The above is the printer’s way of displaying a function closure whose function is at memory address 0x00000000080484f8. That function expects 2 arguments but the closure has 0 arguments so far. The ?s represents the values which have not yet been provided to the closure.

Step 3: Growing Closures

Once we have established closures in the compiler’s environment, we should work toward growing them: creating new closures which carry additional arguments. The compilation of EAppl should generate assembly code that follows approximately this algorithm:

The above algorithm handles the case of growing closures as well as the case of calling functions (which is step 4 below). For now, focus on the first part. For the second part – the case in which you would call the function – you can simply generate an appropriate label and no instructions.

Copying memory with rep movsq

Writing a loop to copy memory in x86 assembly could be a bit tricky and rather error-prone. Thankfully, the processor comes with an instruction which makes it easy to copy several words of memory from one location to another.

To use this instruction:

The rep movsq instruction will proceed by copying one word from rsi to rdi, moving rsi and rdi one 8-byte word forward in memory, decrementing rcx, and repeating this process until rcx is 0. The following is an illustration of an example of how rep movsq works but is not necessarily an example of how you want to use it in your compiler. Learn how this instruction works and then think carefully about how you might use it in your compiler to do what you need to do.

Our example code is as follows:

mov rsi, rbp
add rsi, 16
mov rdi, [heap_cursor]
mov rcx, 3
rep movsq

At the start of rep movsq:

After rep movsq is complete:

To use this instruction, you can simply add an appropriate AsmRepMovsq constructor to your instruction type in assemblyLanguage.ml. You are not required to use rep movsq, but it will likely make your code much cleaner and easier to debug. It’s also considerably more efficient for the CPU.

Checkpoint

Once you’ve finished the code which grows closures, you should be able to see the closures growing for functions with several parameters. Try writing a program which declares a function f of six parameters. Then, call f with an argument and see if the result is an appropriate closure. Call that closure with another argument to see if it gets copied in correctly as well. You still won’t be able to execute functions, but you can grow the closures of multi-argument functions. We’ll dealing executing function code next.

Step 4: Calling Functions

Once you’ve managed to make your closures grow appropriately, move on to implementing the function-calling part of the algorithm above. Probably the trickiest part of this is to copy the arguments from the closure onto the stack in the right order, although this is easier than it seems.

Remember: arguments are pushed onto the stack in reverse order only because stack memory grows downwards; the first argument to a function is lowest in memory. The closures also keep their arguments in ascending memory order, so it is enough simply to copy those arguments from the closure onto the right place in the stack and then copy the last argument after them.

Checkpoint

At this point, your unit tests from Eagle should all pass! You’ll need to update the syntax of function declarations and function calls if you haven’t already done so. You’ll also need to change any tests that relied on zero-parameter functions. In those cases, just add a parameter that you don’t use and then call the function with a dummy value, like 0.

What’s more: you should now be able to pass functions as arguments to other functions! Try writing some programs which make use of your new compiler features. For instance, you could write an ntimes function that takes parameters f, n, and x and calls f on x a total of n times.

Miscellaneous details

Make sure to update the code for istuple as well as the code that validates that an argument is a tuple for e.g. tuple indexing. The Eagle version of this code can simply check the tag bits of the argument, but Falcon uses the same tag bits for tuple pointers and closure pointers. You’ll need to check both the tag bits and the first bit of the heap value to determine if an argument is a tuple or not.

Completion!

At this point, we have added the two key features of Falcon: first-class functions and partial application. Write some unit tests to be confident of your implementation and then congratulate yourself! The rest of the lab is optional; only the Falcon compiler is required for a grade. If you have the time and resources, however, you may find the optional part of this assignment interesting and/or informative.

Summary

To complete this assignment, you’ll need to

See below for an optional exercise in which you can implement anonymous functions like the fun x -> x syntax found in OCaml!

Submitting

It is not enough just to push your code. Due dates are flexible, so it is necessary for you to take action to make it clear which commit you would like to have graded. We will use Git tags for this. A tag is a way of giving a human-readable name to a particular commit. Unlike a branch, however, tags are expected not to change once they are created.

When you are finished working on this compiler assignment, commit and push your work. Then, once you are sure that there are no additional changes you need to make, run the following commands:

$ git tag falcon-submission
$ git push --tags

This will create a tag named falcon-submission and push that tag to the Swarthmore GitHub Enterprise server. Your work on that tag will be graded.

In addition to pushing and tagging your work, you will need to fill out a brief questionnaire found here. In most cases, the questionnaire should take less than a minute. This questionnaire is required and will be used as part of your participation grade.

If You Have Trouble…

…then please contact your instructor! The course forum is the preferred method, but you can reach out via e-mail as well. Good luck!

Finch: Anonymous Functions (optional!)

finch
Finch

The Falcon compiler includes two key features found in most functional languages: first-class functions and partial application. However, it’s missing a third feature: anonymous functions. You are not required to implement anonymous functions for this lab (and no course credit will be provided for doing so), but this language feature is extremely useful and a compiler which supports it is more satisfying. :) If you decide to implement this compiler, first commit and push your work on Falcon.

Let’s develop a small extension to Falcon called Finch. You can begin by running the following Git commands. Note that these commands are a bit different from your usual lab starters. In the usual lab starters, your instructor has already created the branch for you. In this case, you are creating a new Git branch yourself.

git branch finch
git checkout finch
git push -u origin finch

If, while developing Finch, you find a bug in your Falcon work, there is a section below to help you make your change to the Falcon compiler.

Syntax and Semantics

Anonymous functions can be supported simply by adding a new form of expression to the Finch grammar:

<expression> ::= ...
  | fun <identifier> -> <expression>

This syntax has the same meaning as in OCaml. For instance, the code

let f = (fun x -> fun y -> x + y) in
let increment = f 1 in
(increment 3, increment 7)

evaluates to (4, 8). Again, note that the expression form of let here is different from the declaration form of let above. We won’t allow expressions such as let f x y = x + y in .... We could – this expression is just shorthand for the anonymous function syntax in the above example – but it’s not the interesting part of implementing anonymous functions. (If you’d like to add that syntax as well, feel free to do so. Your instructor would be happy to advise!)

Parsing and the AST

To allow anonymous functions to appear in our programs, we must modify the parser and the AST. First, open src/language/asts.ml and add a constructor to the expr data type:

| ELambda of string * expr

This “lambda” constructor will represent our anonymous functions: the string contains the name of the variable while the expression is the function’s body. (Anonymous functions are traditionally denoted in programming language theory using the Greek letter λ, transliterated as “lambda”: hence the name.)

Once you have added the ELambda constructor, you will need to allow the parser to recognize anonymous function syntax and construct new ASTs for them. First, open src/language/lexer.mll and add the following cases somewhere in the middle of your lexer:

| "fun" { FUN }
| "->" { ARROW }

These cases must appear above the definition of the identifier case; don’t place them at the bottom of the file. Next, open src/language/parser.mly; we will add two lines. The first appears toward the top of the file. Between the last %token declaration but before the first precedence (e.g. %left or %right) declaration, there is a blank line. Replace it with

%token FUN ARROW

%nonassoc fake_FUNCTION_BODY

The first line tells the parser that FUN and ARROW are tokens we will use; these match the tokens that the lexer is now producing. The second line tells the parser that we are creating a new non-associative parsing precedence which is less important that every other precedence (because it’s first in the list). Then, we can add the rule for parsing anonymous functions by adding the following to the top of the expr cases:

| FUN IDENTIFIER ARROW expr { ELambda($2,$4) } %prec fake_FUNCTION_BODY

This should be enough to add anonymous function syntax to your lexer, parser, and AST.

Checkpoint

Once you’ve made these changes, you should be able to make clean, make tests, and run ./tests. Everything that worked in Falcon should work in Finch, but you’ll get warnings about the unhandled ELambda case.

Implementing Anonymous Functions via Closure Conversion

The only new feature of Finch is represented by the ELambda constructor. Here, we can use a reductive approach: since we know how to handle ASTs without ELambda expressions, could we just get rid of them? We will apply a technique called closure conversion to rewrite the Finch AST into an AST without any ELambda constructors before we compile.

Closure conversion will be a preprocessing step to the Finch compiler. That is, the steps of the compilation pipeline are:

  1. Parse the source code into an AST
  2. Perform well-formedness checking on the AST
  3. Closure-convert the AST to eliminate ELambda expressions
  4. Compile the resulting AST (with no ELambda expressions) as per Falcon

Our compile_expression function will operate with the invariant that it is never given an AST with an ELambda expression in it. You can add an ELambda case to its primary match expression that just runs failwith "ELambda in compilation" or something similar. If closure conversion works correctly, we should never be able to reach that case.

Examples of Closure Conversion

We will convert anonymous functions like fun x -> x into named functions (of the form DFunction) by searching the AST for them, picking a fresh name to use, creating a declaration with that name, and replacing the anonymous function with a reference to that new name as a variable. For instance, we might transform

def twice f x =
  f (f x)
end

twice (fun a -> a * 2) 5

into

def twice f x =
  f (f x)
end

def $lambda$1 a =
  a * 2
end

twice $lambda$1 5

Note here that, while $ is not a legal character for the Finch parser, you can put any string you want into the AST (provided that your output assembles). By using a character like $ which is illegal in Finch but legal in nasm, we can be sure that the user never wrote a variable with that name.

Non-Local Variables

When we closure-convert anonymous functions, we must be careful to handle non-local variables correctly. A non-local variable is one which is used within the function but not defined by it. For instance, in the program

let x = 2 in
(fun y -> x + y) 4

the variable x is non-local to the anonymous function. When we closure-convert this code, we will add the non-locals as parameters to the resulting declaration:

def $lambda$1 x y =
  x + y
end

let x = 2 in
$lambda$1 x 4

To do this correctly, you will want to implement a function for calculating the free variables in an expression AST. (See below for a brief explanation.) The non-local variables of an anonymous function are the free variables of its body minus the parameters it declares.

The Closure Conversion Algorithm

You are recommended to develop your closure conversion algorithm in a separate module (e.g. closureConversion.ml). The implementation of closure conversion will likely include the following subroutines:

  1. A recursive function free_vars to calculate which variables are “free” in an expression. A free variable is one which is used but not bound. Even if an expression has no unbound variables, its subexpressions may. For instance, the expression let x = 4 in x + x has no unbound variables but the subexpression x + x does. This function should probably take an expr and return a set of variable names.

  2. A recursive function closure_convert_expression which takes an expr and returns an expr * declaration list. This function will find each ELambda in the provided expression, create a new declaration to replace it, and then replace the ELambda with a reference to that new declaration. You probably want to use fresh_name to generate names for these new functions.

  3. A function closure_convert_declaration of type declaration -> declaration * declaration list. This function is not complex; it simply applies closure conversion to the body of a declared function.

  4. A function closure_convert_program of type program -> program. This function is also not complex. It calls closure_convert_declaration on each function declaration and closure_convert_expression on the main body of the program. Rather than returning the new declarations produced by closure conversion, it incorporates them into the program definition.

Once you have implemented this algorithm and called it from your compiler.ml file, your Finch compiler should be complete!

Git and Branches

It’s possible that, during Finch development, you’ll realize that you’ve made a mistake in Falcon. This kind of problem is not uncommon in industry projects, so there’s a fairly straightforward way to handle this using Git.

First: do not fix the bug yet. If you’ve found a bug in your Falcon implementation, it should be fixed on the falcon branch. We need to set your changes to the finch branch aside and return to the point in time that the falcon compiler was last changed. You may either commit your current changes to finch or run git stash to store them for later.

Once git status tells you that your working directory is clean, you can run git checkout falcon to return to the latest commit of the Falcon branch. Run make clean to be sure that none of the leftover OCaml binaries are left in your workspace. Then, fix the bug as you would’ve done if you hadn’t yet started work on Finch.

Once you’ve finished updating Falcon, commit and push your work. Then, run git checkout finch followed by git merge falcon. This will bring the fix that you’ve made into the finch branch so you don’t have to fix it twice. If you ran git stash earlier, run git stash pop now. Note that this may result in a “merge conflict”: a case in which the tool cannot figure out how to automatically merge the bug fix with your Finch code. In that case, you will have to help Git by editing the conflicted files so that everything compiles again and then create a new commit to finish the merge. In either case, you should now have