Compilers

Neonate

baby snake
A Baby Snake

This is a lab activity and will not be submitted for a grade. The purpose of this activity is to familiarize you with the process of building a compiler. We will do this by building an extremely simplistic (and not very useful) compilation tool. Let’s begin by cloning a Git repository:

git clone git@github.swarthmore.edu:cs75-s17/neonate-<username>.git

If you inspect the contents of that repository, you’ll discover… nothing. There’s no code in it; just a .gitignore and a .merlin. This is intentional: we’re going to build a compiler from scratch. It is nice, though, to have a repository to help you keep track of your history.

The Big Picture

In this class, we have the task before us of building source code into a binary executable. The picture below illustrates the strategy that we will use.

The program that the user writes is the “program source”, which we translate into a data structure of some kind. We may perform a variety of transformations on that data structure before translating it into a series of assembly instructions, which we can then build into an object file (.o) similar to those produced by clang or gcc.

We will rely upon existing tools to assemble our code into the .o file as well as to translate the .o into an executable. We will also rely upon existing tools to compile a “driver” for the program: the operating-system-specific part of the executable that gets us from the start of our process to the code that we generated. These steps are heavily system-specific and not as critical to this course.

This Activity

In this activity, the first two steps of the above diagram are much simplified:

With these assumptions, we can construct a compilation toolchain in the space of a single lab!

The Driver

In Computer Science, the term driver has taken on several different meanings. In this course, we use a somewhat less common use of the term: a driver is the topmost part of your program that gets everything set up so your code can run. Our driver will be a standard C program with a main function; this allows us to rely on an existing compiler to do the system-specific operations necessary to get our process running. For this, the following driver.c should be sufficient:

#include <stdio.h>

int neonate_main() asm("neonate_main");

int main(int argc, char** argv) {
  int result = neonate_main();
  printf("%d\n", result);
  return 0;
}

The neonate_main declaration tells the C compiler that it will eventually be linked with an assembly-level function named neonate_main. The asm keyword ensures that the compiler will use the name neonate_main for this function at the assembler level.

We can compile this driver into a binary object file:

$ clang -g -c -o driver.o driver.c

Here, the -c flag tells clang to build a .o file instead of an executable. We can think of the .o file as a fragment of a program; it hasn’t yet been turned into a real program, but some amount of source code has been turned into machine code for us. The -g flag instructs the compiler to keep debugging information, which is usually a handy thing to have around when you’re writing software.

If we try to compile the driver by itself, we get an error:

$ clang -g driver.o
driver.o: In function `main':
driver.c:6: undefined reference to `neonate_main'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

This is because neonate_main has not yet been defined. We need to develop a process for taking the “source code” (a file containing an integer) and translating it into machine code for neonate_main. We’ll do that next.

Assembly Language

Next, we are going to

Note that we are writing the assembly program ourselves; this is just to get a trial run together. Later, we will write our compiler in OCaml to generate this assembly code.

For this course, we will generate 32-bit x86 assembly using the so-called “Intel syntax”; you can use this guide by David Evans at the University of Virginia as an x86 assembly language reference. Here is a simple assembly program that defines a neonate_main function to return the value 37:

section .text
global neonate_main
neonate_main:
  mov eax, 37
  ret

Let’s review this code one line at a time:

We can put this file in example.s to serve as an example assembly program for now; we’ll talk about generating it later. Let’s build this example.s into a program object and link it into the wrapper to create an executable.

Building Our Program

There are a variety of assemblers – tools that translate assembly into machine-readable formats – that we could use. For this course, we will use nasm, which is already installed on the department machines. We will use nasm to turn our assembly code (.s) into binary object files (.o) so that we can link our assembly with our wrapper. We can use the following command to build our example above:

$ nasm -f elf32 -o example.o example.s

The -o flag, like with gcc or clang, tells nasm what output file to create. The -f flag tells nasm to build our file in a particular format; here, elf32 stands for the 32-bit Executable and Linkable Format (ELF) which is commonly used by Linux systems. (On Mac OS X, you can instead use -f macho to build a binary object file that Mac OS X can use.) The particular format we pick is based on our operating system and we won’t go into too much detail about it here.

Once we have built example.o, all that remains is to link it with the driver.o we created from our wrapper. If we try to do that in a way that resembles compilation from CS35, we’ll get an error:

$ clang -g -o example example.o driver.o
/usr/bin/ld: i386 architecture of input file `example.o' is incompatible with i386:x86-64 output
clang: error: linker command failed with exit code 1 (use -v to see invocation)

This message is occurring because example.o is in 32-bit executable format (from the elf32 format above) but driver.o is in 64-bit executable format (because this is the default on a 64-bit machine with clang or gcc). We can confirm this by using the program file to figure out the format of the driver.o we are using:

$ file driver.o
driver.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped

We need to build driver.o in 32-bit format to make it compatible with example.o:

$ clang -g -m32 -c -o driver.o driver.c

The -m32 flag here tells the compiler to build a 32-bit .o file. Next, we can link our two 32-bit .o files together:

$ clang -g -m32 -o example.run driver.o example.o

Finally, we can run our program!

$ ./example.run
37

We can confirm our understand of the files we’ve just created by examining them with the file program:

$ file driver.o
driver.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
$ file example.o
example.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped
$ file example.run
example.run: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), dynamically linked, interpreter /lib/ld-linux.so.2, for GNU/Linux 2.6.32, BuildID[sha1]=6eb54bf88ff34626c050f02f9240b5fc1b119357, not stripped

We’ve just used our wrapper and our example assembly code to build a binary executable! The only thing missing is that we had to write our assembly code by hand, but we want a compiler to write the code for us. We’ll do that next.

Neonate: Our First Compiler

Rather than writing example.s by hand, we want a compiler to generate it for us. For this activity, we’ll have the compiler read the file on standard input and write the file to standard output; that way, we don’t need to learn about OCaml’s I/O API yet. We can then instruct the command line to feed the source code into the compiler and capture what it prints.

We can start by writing this skeleton of compiler.ml:

let compile (program : int) : string =
  failwith "TODO"
;;

(* The following is essentially a way of writing a "main" in OCaml. *)
let () =
  (* Get a line from standard input and assume that it is our "program". *)
  let program_str = read_line () in
  (* Convert the input to an integer. *)
  let program = int_of_string program_str in
  (* "Compile" the program into assembly. *)
  let assembly = compile program in
  (* Print the resulting assembly code. *)
  print_endline assembly
;;

A more sophisticated compiler might read from one or more files instead of using standard input to get its program. Likewise, it might write to one or more files instead of printing the resulting assembly. It might even call the assembler directly instead of writing a file to speed up the process. The above code has the advantage of being much simpler.

In the above, the compile function has been left unimplemented. Use what you know about OCaml strings to replace the failwith "TODO" with an expression that generates the right assembly code for your program.

Once you’ve implemented your compile function, we can test it. Create a file 19.neonate containing just the number 19:

$ echo "19" > "19.neonate"

Then, we can run our compiler. We use the < operator in bash to replace the standard keyboard input with the contents of a file (as if you’d copied the file and pasted it into your terminal):

$ ocaml compiler.ml < 19.neonate
section .text
global neonate_main
neonate_main:
  mov eax, 19
  ret

We can capture that output to a file by using the > operator and then assemble it using nasm. Then, we can link the resulting program with our wrapper as we did before:

$ ocaml compiler.ml < 19.neonate > 19.s
$ nasm -f elf32 -o 19.o 19.s
$ clang -g -m32 -c -o driver.o driver.c
$ clang -g -m32 -o 19.run driver.o 19.o
$ ./19.run
19

We can automate the above compilation process using a Makefile. Note that Makefiles use tabs and not spaces; if your editor is configured correctly, it should deal with this for you.

# The following line tells "make" that it should keep our intermediate files.
.SECONDARY:

# To make a .run file, we need the .o file.  We link it with the wrapper.
# The symbol $@ is "the thing we're making", e.g. the .run file
# The symbol $< is "the first thing we need", e.g. the .o file
%.run: %.o driver.o
    clang -g -m32 -o $@ driver.o $<

# To make a .o file, we need the .s file.  We build it using the assembler.
%.o: %.s
    nasm -f elf32 -o $@ $<

# To make a .s file, we need the .neonate file.  We generate it with the
# compiler.
%.s: %.neonate
    ocaml compiler.ml < $< > $@

# An exception to the .o rule above: we build driver.o from driver.c.
driver.o: driver.c
    clang -g -m32 -c -o driver.o driver.c

# It's always good to know how to clean up.
.PHONY: clean
clean:
    rm -f *.run
    rm -f *.o
    rm -f *.s

If your Makefile is correct, you should be able to use it to run all of the necessary commands.

$ make clean
rm -f *.run
rm -f *.o
rm -f *.s
$ make 19.run
ocaml compiler.ml < 19.neonate > 19.s
nasm -f elf32 -o 19.o 19.s
clang -g -m32 -c -o driver.o driver.c
clang -g -m32 -o 19.run driver.o 19.o
$ ./19.run
19

That’s a Compiler!

That’s it! We’ve just built a tool that will take an extraordinarily simple “program” (a document containing an integer) and transform it into something the computer can run. This is a compiler.

There are a few caveats, of course:

In the meantime, you can try making changes to your compiler to see how it behaves. Can you change the input language in a useful way? Can you find a case in which this compiler does not work correctly? Remember to git commit before trying to make changes!

Acknowledgements

Thanks to Abdulaziz Ghuloum for the structure of this activity and to Joe Politz for much of the specific content.