On this page:
1 The Setup
1.1 Trying out Examples
1.2 More Examples
2 The Idea
3 Wrapping Up
3.1 The Let Case
3.2 Variable Generation
3.3 Too Many Variables?
ANF

We discussed in lecture on Tuesday how it can be useful to restrict the language the compiler sees, so that things like binary operators only have to work with values, rather than figuring out how to store the results of intermediate sub-expressions. We introduced a restricted form of the AST, and concluded that it would be useful to have an automatic conversion from the user-facing AST to the restricted AST the compiler sees.

All that’s left is to implement it! That’s the topic of this note.

1 The Setup

We have two languages, the user-facing one, and the compiler-facing one:

Input language:

type expr =

  | Num of int

  | Id of string

  | Plus of expr * expr

  | Minus of expr * expr

  | Let of string * expr * expr

Restricted language:

type immexpr =

  | ImmNum of int

  | ImmId of string

 

type cexpr =

  | CPlus of immexpr * immexpr

 

type aexpr =

  | ALet of string * cexpr * aexpr

Our goal is to write a function that takes the former and produces the latter:

let anf (e : expr) : aexpr =

  ...

1.1 Trying out Examples

Before we dive in, let’s think of some examples. Let’s first think in terms of concrete syntax, and then translate the examples to uses of the datatypes expr and aexpr above.

If we have an expression like

(5 + 4) - 2

That doesn’t satisfy our restrictions, because (5 + 4) appears nested inside the subtraction expression. So we’d want to convert it to something like:

let v = 5 + 4 in

v - 2

It’s clear that these two programs have the same answer, and the second one has the restriction that each nested expression is bound in a let. Let’s try and translate these two programs to the datums we would need for a test case. The input expression is

Minus(Plus(Num(5), Num(4)), Num(2))

and the output expression is:

ALet("v", CPlus(ImmNum(5), ImmNum(4)),

  CMinus(ImmId("v"), ImmNum(2)))

Look at that last example, then look at the types for aexpr. Do you see the bug? It’s a type error. See if you can find it before moving on.

There’s a type error in the last argument to ALet. The body of ALet needs to be an aexpr according to the type. But CMinus is a cexpr, so there’s a type error here. Oops. What’s happening here is our type doesn’t quite express all the programs we want. It needs to be just a little bit bigger, so we can talk about aexprs that actually consist of a compound expression, not just a let. So we can add that:

type aexpr =

  | ALet of string * cexpr * aexpr

  | ACExpr of cexpr

Now we can express the output program by using the ACExpr constructor:

ALet("v", CPlus(ImmNum(5), ImmNum(4)),

  ACExpr(CMinus(ImmId("v"), ImmNum(2))))

This actually brings to mind an even simpler example that could cause trouble. What about the program

37

How can we express that as an aexpr? And what about

let x = 5 in x

We also need to be able to express the case where an immediate value appears in a cexpr or aexpr position. We need one more extension to handle this:

type cexpr =

  | CPlus of immexpr * immexpr

  | CImmExpr of immexpr

So now 37 can be written as:

ACExpr(CImmExpr(ImmNum(37)))

Take a moment to figure out yourself how to express let x = 5 in x with these new extended definitions of aexpr and cexpr.

1.2 More Examples

With this definition in hand, we can think about what we want for some more interesting examples. Let’s try one with two levels of nesting, first:

(5 + (4 - 3)) + 2

What should this produce? It’s useful to think about the order in which we would evaluate the expressions, and note that 4 - 3 should happen first. So that ought to be the first let-binding we generate:

let v1 = 4 - 3 in

let v2 = 5 + v1 in

v2 + 2

As a test, we should have that

anf (Plus(Plus(Num(5), Minus(Num(4), Num(3))), Num(2)))

 

# equal to

 

(ALet("v1", CMinus(ImmNum(4), ImmNum(3)),

  (ALet("v2", CPlus(ImmNum(5), ImmId("v1")),

    (ACExpr(CPlus(ImmId("v2"), ImmNum(2))))))))

Let’s try nesting on both sides:

(5 + 4) - (3 + 2)

becomes

let v1 = 5 + 4 in

let v2 = 3 + 2 in

v1 - v2

For practice, try translating this example into the data structure representation. This gives us a few examples to use as guides while we work through the implementation.

2 The Idea

Let’s try writing the anf function now that we have these examples in mind. As always, we’ll start with match.

let rec anf (e: expr) : aexpr =

  match e with

    | Num(n) ->

    | Id(x) ->

    | Plus(left, right) ->

    | Minus(left, right) ->

    | Let(x, value, body) ->

It seems at first like the Num and Id cases are easy. We just need to wrap them in ACExpr and CImmExpr, use the Imm version of the constructor, and we’re done:

| Num(n) -> ACExpr(CImmExpr(ImmNum(n)))

| Id(x) -> ACExpr(CImmExpr(ImmId(x)))

Superficially, this seems to work for trivial programs like 5, which generate an aexpr version of themselves. Let’s next tackle Plus, and see if things stay easy. In the Plus case, we have the left and right sub-expressions to consider, which are both exprs. We can start by calling anf on both of them to get converted versions:

| Plus(left, right) ->

  let aleft = anf left in

  let aright = anf right in

    ...

Now we just have to fill in the ..., presumably using aleft and aright. What do we want to get out? Let’s think about one of our examples, which should guide us. One of our simple cases was that we should have:

anf (Minus(Plus(Num(5), Num(4)), Num(2)))

 

# becomes

 

ALet("v", CPlus(ImmNum(5), ImmNum(4)),

  CMinus(ImmId("v"), ImmNum(2)))

Let’s think about what we expect for the two recursive calls, given this setup. Calling (anf (Plus(Num(5), Num(4)))), which is the value of aleft in our example, should give us

ACExpr(CPlus(ImmNum(5), ImmNum(4)))

since that’s the obvious translation. Then the right-hand-side, which is Num(2), should become ACExpr(CImmExpr(ImmNum(2))). To summarize, we have:

aleft = ACExpr(CPlus(ImmNum(5), ImmNum(4)))

aright = ACExpr(CImmExpr(ImmNum(2)))

and we want to combined them to produce:

ALet("v", CPlus(ImmNum(5), ImmNum(4)),

  CMinus(ImmId("v"), ImmNum(2)))

How do we do that? It’s... not really obvious at all. There are a few wide open questions:

Let’s step back a little bit. Before we commit to a particular approach, let’s look at a few of our other examples. Let’s first consider the case of nesting on both sides:

anf (Plus(Plus(Num(5), Num(4)), Plus(Num(3), Num(2))))

 

# equal to

 

ALet("v1", CPlus(ImmNum(5), ImmNum(4)),

  ALet("v2", CPlus(ImmNum(3), ImmNum(2)),

    ACExpr(CPlus(ImmId("v1"), ImmId("v2")))))

There are some interesting features of this example. The innermost body, containing the final CMinus, has both its left and right hand sides replaced by identifier expressions. Since both weren’t immediate values, they needed to be evaluated first, stored in identifiers, and the identifiers used in their place. Maybe the only problem above was that we needed a helper other than anf, which returned cexprs rather than aexprs. So we might try something like the following, imagining that we could write such a helper:

| Plus(left, right) ->

  let aleft = anf_but_as_cexp_not_aexp left in

  let aright = anf_but_as_cexp_not_aexp right in

  ALet("v1", aleft,

    ALet("v2", aright,

      ACExpr(CPlus(ImmId("v1"), ImmId("v2")))))

This breaks down, though, when we consider another earlier example, that of double-nesting on the left:

anf (Plus(Plus(Num(5), Minus(Num(4), Num(3))), Num(2)))

 

# equal to

 

ALet("v1", CMinus(ImmNum(4), ImmNum(3)),

  ALet("v2", CPlus(ImmNum(5), ImmId("v1")),

    ACExpr(CPlus(ImmId("v2"), ImmNum(2)))))

Here, the left-hand side is Plus(Num(5), Minus(Num(4), Num(3))), which cannot become a single cexpr, because its own anf conversion would require introducing a let. This really gets at the crux of the issue, which is that we don’t know, without recursing deep into the structure of nested expressions, how many levels of let-bindings will be needed to fully process an expression. All we know is that we need to turn both sides into a sequence of nested lets, and substitute in the resulting immediate value (an identifier), for the appropriate position.

So our binary operator expressions need to become expressions with holes that we can fill in with the appropriate identifiers. When we go to anf an expression like

Plus(Plus(Num(5), Minus(Num(4), Num(3))), Num(2))

We want to break it into:

Plus(●, ●)

and

ALet(... maybe many intermediate steps for lhs ...,

  ALet("v1", ... final value for lhs ...,

    ALet(... maybe many intermediate steps for rhs ...,

      ALet("v2", ... final value for rhs ...,

        ... here we can fill in the ●'s above with v1, v2 ...))))

If we could express this split, we may be able to store the Plus-expression-with-hole until the anf conversion on its left and right sides is done, and then fill in the missing pieces. Here’s the key insight we’ll use for tackling the implementation of ANF: we can represent expressions with holes in them as OCaml functions. That is, a plus expression that’s “waiting” for its arguments can be represented as:

fun limm rimm -> CPlus(limm, rimm)

The limm and rimm are chosen to mean “left immediate” and “right immediate,” so this function represents a CPlus expression that is waiting for two immediate values to be provided. Then, we can set up anf so that it is able to fill in these values once the appropriate intermediate bindings have been added.

To do this, we’re going to change the signature of anf a bit. It’s going to take an additional argument, which will be a function representing an expression with a immexpr-shaped hole in it:

let rec anf (e : expr) (expr_with_hole : (immexpr -> aexpr)) =

  ...

The job of anf for each case will be to fill in the hole with a immexpr (usually an ImmId) that is the result of the last let-binding needed to evaluate the sub-expressions of e. To fill in the hole, we simply apply the function expr_with_hole. In the base cases of numbers and identifiers, we can simply pass in the immediate value directly, since we have the correct value to put in the hole:

let rec anf (e : expr) (expr_with_hole : (immexpr -> aexpr)) =

  match e with

    | Num(n) -> (expr_with_hole (ImmNum(n)))

    | Id(x) -> (expr_with_hole (ImmId(x)))

    ...

This captures an idea – when performing anf, we don’t just want to convert a single subexpression, like a number or identifier, on its own. We also need to know the context in which its result should be placed. That’s what expr_with_hole is for. Next we can ask what happens in the binary operator case. First, we need to anf the left-hand side. Now that anf takes two arguments, we need to decide what second argument to pass to ANF, which must be a function:

| Plus(left, right) ->

  anf left (fun ...)

We know from the type of the expr_with_hole parameter to anf that the argument to the function we provide will be an immexpr. We know from the discussion above that it will ben an immexpr representing the left-hand-side value. So we want to use that in a CPlus expression, on the left-hand side:

| Plus(left, right) ->

  anf left (fun limm -> CPlus(limm, ...))

But we need the immediate expression for the right-hand-side, too. That we can get from using anf on right. So before we try to construct the CPlus expression, we need to get that immediate expression:

| Plus(left, right) ->

  anf left (fun limm ->

    anf right (fun rimm ->

      CPlus(limm, rimm)))

So we’ll anf the left-hand-side, then the right-hand-side, and construct a CPlus expression with the results, applied one at a time. This lets us construct a CPlus with immexprs for both left and right.

There’s something a little funny about this code, though. First of all, the Plus case doesn’t use the expr_with_hole argument at all. It’s worth being suspicious of code that doesn’t use one of the (presumably important) parameters to a function. Also, there’s a type error again! The expr_with_hole parameter is supposed to have the type (immexpr -> aexpr). But the return we’ve given here has type cexpr, since it’s a CPlus expression.

It’s worth thinking through what will happen if we try this on one of our examples, using our trusty tactic of using substitution to work through it:

    anf (Plus(Plus(Num(5), Num(4)), Plus(Num(3), Num(2))))

 

(* Substitute the large expression in, and match the Plus case: *)

 

=>  anf (Plus(Num(5), Num(4)) (fun limm ->

      anf (Plus(Num(3), Num(2))) (fun rimm ->

        CPlus(limm, rimm)))

 

(* Process the next anf call, another Plus case, with Num(5) and Num(4) as

   left and right.  Did we lose some information here? *)

 

=>  anf (Num(5)) (fun limm ->

      anf (Num(4)) (fun rimm ->

        CPlus(limm, rimm)))

 

(* In the number case, we *apply* the given function to an ImmNum: *)

 

=>  ((fun limm ->

      anf (Num(4)) (fun rimm ->

        CPlus(limm, rimm)))

     (ImmNum(5)))

 

=>  (anf (Num(4)) (fun rimm -> CPlus(ImmNum(5), rimm)))

 

=>  ((fun rimm -> CPlus(ImmNum(5), rimm)) (ImmNum(4)))

 

=>  CPlus(ImmNum(5), ImmNum(4))

This is not the answer we’d want at all – this totally dropped the Plus(Num(3), Num(2)) part of the expression! That part of the process was part of the suspiciously lost information in the second step above. Since we never actually use that expr_with_hole, it is simply lost. So we need to re-examine the Plus case, and see how we should use the expr_with_hole parameter:

| Plus(left, right) ->

  anf left (fun limm ->

    anf right (fun rimm ->

      (* need to use expr_with_hole here... *)

      CPlus(limm, rimm)))

Based on our earlier discussion, we know we want to apply expr_with_hole to an immediate expression representing the value of the current expression. Said another way, we need an identifier that will hold the value of the current expression, where the current expression is the Plus we’re working on. We need to create a let-binding to store the result of the CPlus expression! Then we can create a immexpr—an ImmId that holds the identifier bound in the let—and pass that on to expr_with_hole:

| Plus(left, right) ->

  anf left (fun limm ->

    anf right (fun rimm ->

      ALet("result_of_plus", CPlus(limm, rimm)

        (expr_with_hole (ImmId("result_of_plus"))))))

One thing we need to consider before testing this for real is that now that we have anf as a two-argument function, it’s a little trickier to know how to call it the first time. If all we have is an expr, what do we pass in for the initial value of expr_with_hole? In fact, we cheated on this in the substitution-evaluation example above, and just skipped the second argument in the initial call!

The goal of the program is to produce the answer for the original expression. So if we use an expr_with_hole that simply returns the returned immexpr unchanged, that should do the trick. That means a call like:

anf original_expr (fun ie -> ACExpr(CImmExpr(ie)))

will be needed. With this in mind, work through on your own this call to anf:

anf (Plus(Plus(Num(5), Num(4)), Plus(Num(3), Num(2))))

    (fun ie -> ACExpr(CImmExpr(ie)))

Is the result what you expected? Does it exactly match the test case we wrote?

3 Wrapping Up

There are a few remaining things to tackle.

3.1 The Let Case

One is the Let case:

| Let(x, value, body) ->

  ...

Based on the discussion above, we need to make sure we anf the value and the body, and pass an immexpr representing the result of the body into expr_with_hole. We also can’t forget to do the work of the Let itself, which is to bind the variable!

| Let(x, value, body) ->

  (anf value (fun immval ->

    ALet(x, CImmExpr(immval),

      (anf body (fun immbody ->

        (expr_with_hole immbody))))))

There’s a tiny bit of cleanup we can do to this code. Any time we have a single-parameter function that just applies another function to its argument, we can simply use that other function instead:

| Let(x, value, body) ->

  (anf value (fun immval ->

    ALet(x, CImmExpr(immval),

      (anf body expr_with_hole))))

Here we just need to pass along expr_with_hole when we anf the body.

3.2 Variable Generation

Another thing to clean up is the names we use in the generated ALets we create. We can’t really just name them all result_of_plus as we did above – we could end up with generated ANF’d code like:

(5 + 4) + (3 + 2)

 

# becomes

 

let result_of_plus = 5 + 4 in

let result_of_plus = 3 + 2 in

let result_of_plus = result_of_plus + result_of_plus in

result_of_plus

This gives a different answer than intended. So we need a method for generating new ids on the fly that don’t conflict with others. A simple way to do this is to create a function that uses a mutable counter, and appends the value of the counter to a string after incrementing it. Then each call creates a fresh string that we can use:

let count = ref 0

let gen_temp base =

  count := !count + 1;

  sprintf "temp_%s_%d" base !count

Then the Plus step above would become:

| Plus(left, right) ->

  let varname = gen_temp "result_of_plus" in

  anf left (fun limm ->

    anf right (fun rimm ->

      ALet(varname, CPlus(limm, rimm)

        (expr_with_hole (ImmId(varname))))))

This would produce output like:

(5 + 4) + (3 + 2)

 

# becomes

 

let result_of_plus1 = 5 + 4 in

let result_of_plus2 = 3 + 2 in

let result_of_plus3 = result_of_plus1 + result_of_plus2 in

result_of_plus3

Which has the correct value if we compile and run it. If we use "v" rather than "result_of_plus", we can get something resembling our initial examples.

3.3 Too Many Variables?

Let’s take a closer look at this example:

(5 + 4) + (3 + 2)

 

# becomes

 

let v1 = 5 + 4 in

let v2 = 3 + 2 in

let v3 = v1 + v2 in

v3

It doesn’t actually match what we wanted in the beginning, which skipped the third variable binding:

let v1 = 5 + 4 in

let v2 = 3 + 2 in

v1 + v2

It seems like we’re doing one step more of work than we need to. The ANF transformation in this writeup is actually just the simplest of a family of possible ANF transformations. It’s the one we’ll study and implement in detail, but there are tweaks to it that add fewer intermediate variables. One thing you can try on your own is to define two mutually recursive anf functions, with these signatures:

let rec anf_c (e : expr) (expr_with_c_hole : cexpr -> aexpr) : aexpr = ...

 

let rec anf_imm (e : expr) (expr_with_imm_hole : immexpr -> aexpr) : aexpr = ...

Call the first when the result is needed in a context that can take a cexpr, and the second when the result is needed in a context that can take a immexpr.