Compilers

Hatchling

A Baby Bird

Source: Wikimedia Commons: by Fir0002/Flagstaffotos (Own work) [CC BY-NC 3.0]

This is a lab activity. It will not be evaluated for a grade, but work on this activity is required for participation. The purpose of this activity is to familiarize you with the process of building a compiler. We will do this by building an extremely simplistic (and not very useful) compilation tool. Let’s begin by cloning a Git repository:

git clone git@github.swarthmore.edu:cs75-s22/hatchling-<username>.git

If you inspect the contents of that repository, you’ll discover… nothing. There’s no code in it; just a .gitignore file. This is intentional: we’re going to build a compiler from scratch. It is nice, though, to have a repository to help you keep track of your history.

The Big Picture

In this class, we have the task before us of building source code into a binary executable. The picture below illustrates the strategy that we will use.

$A diagram of the compiler pipleline described in the following paragraph.$

The program that the user writes is the “program source”, which we translate into a data structure of some kind. We may perform a variety of transformations on that data structure before translating it into a series of assembly instructions. Those assembly instructions are then transformed into a binary object file (.o) or similar to those produced by clang or gcc.

We will rely upon existing tools to assemble our code into the .o file as well as to translate the .o into an executable. We will also rely upon existing tools to compile a “driver” for the program: the operating-system-specific part of the executable that gets us from the start of our process to the code that we generated. We then link that driver object to our program object to produce the final executable. These steps are heavily system-specific and not as critical to this course.

This Activity

In this activity, the first two steps of the above diagram are much simplified:

The “language” that we will compile contains only integers. Each source file will contain a single integer value and the compiled program will simply print that integer before terminating. “Parsing” is therefore just reading an integer from a file.
We will not transform this integer in any way during the compilation process; the step named “transformations” above is a no-op.

With these assumptions, we can construct a compilation toolchain in the space of a single lab!

The Driver

In Computer Science, the term driver has taken on several different meanings. In this course, we use a somewhat less common form of the term: a driver is the topmost part of your program that gets everything set up so your code can run. Our driver will be a standard C program with a main function; this allows us to rely on an existing compiler to do the system-specific operations necessary to get our process running. For this, the following driver.c should be sufficient:

#include <stdio.h>

int hatchling_main() asm("hatchling_main");

int main(int argc, char** argv) {
  int result = hatchling_main();
  printf("%d\n", result);
  return 0;
}

The hatchling_main declaration tells the C compiler that it will eventually be linked with an assembly-level function named hatchling_main. The asm keyword ensures that the compiler will use the name hatchling_main for this function at the assembler level.

We can compile this driver into a binary object file:

$ clang -g -c -o driver.o driver.c

Here, the -c flag tells clang to build a .o file instead of an executable. We can think of the .o file as a fragment of a program; it hasn’t yet been turned into a real program, but some amount of source code has been turned into machine code for us. The -g flag instructs the compiler to keep debugging information, which is usually a handy thing to have around when you’re writing software.

If we try to compile the driver by itself, we get an error:

$ clang -g driver.o
driver.o: In function `main':
driver.c:6: undefined reference to `hatchling_main'
clang: error: linker command failed with exit code 1 (use -v to see invocation)

This is because hatchling_main has not yet been defined. We need to develop a process for taking the “source code” (a file containing an integer) and translating it into machine code for hatchling_main. We’ll do that next.

Assembly Language

Next, we are going to

Write an assembly program to define hatchling_main
Link hatchling_main with driver.c to create an executable

Note that we are writing the assembly program ourselves; this is just to get a trial run together. Later, we will write our compiler in OCaml to generate this assembly code.

For this course, we will generate 64-bit x86 assembly using the so-called “Intel syntax”; you can use this guide by David Evans at the University of Virginia as an x86 assembly language reference. Here is a simple assembly program that defines a hatchling_main function to return the value 37:

section .text
global hatchling_main
hatchling_main:
  mov rax, 37
  ret

Let’s review this code one line at a time:

section .text: The following content (up until the next section) is code.
global hatchling_main: The label hatchling_main should be visible to the rest of the world. This way, the linker can see that symbol in the resulting .o file.
hatchling_main:: This defines the label hatchling_main. When other code jumps to the symbol hatchling_main, that code is jumping here.
mov rax, 37: Put 37 into rax, the register conventionally used for return values in C. Our driver.c call to hatchling_main will look in rax after we return to find the result; this way, it finds our 37.
ret: Return to the caller by performing standard stack operations. The C compiler sets up the stack in a way that ret causes us to return to the point right after hatchling_main was called.

We can put this file in example.s to serve as an example assembly program for now; we’ll talk about generating it later. Let’s build this example.s into a program object and link it into the driver to create an executable.

Building Our Program

There are a variety of assemblers – tools that translate assembly into machine-readable formats – that we could use. For this course, we will use nasm, which is already installed on the department machines. We will use nasm to turn our assembly code (.s) into binary object files (.o) so that we can link our assembly with our driver. We can use the following command to build our example above:

$ nasm -f elf64 -o example.o example.s

The -o flag, like with gcc or clang, tells nasm what output file to create. The -f flag tells nasm to build our file in a particular format; here, elf64 stands for the 64-bit Executable and Linkable Format (ELF) which is commonly used by Linux systems. (On Mac OS X, you can instead use -f macho64 to build a binary object file that Mac OS X can use.) The particular format we pick is based on our operating system and we won’t go into too much detail about it here.

Once we have built example.o, all that remains is to link it with the driver.o we created from our driver. clang has a linker we can use for this purpose just as we have done in the past in courses like CS35:

$ clang -g -o example.run example.o driver.o

Finally, we can run our program!

$ ./example.run
37

We can confirm our understand of the files we’ve just created by examining them with the file program:

$ file driver.o
driver.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), with debug_info, not stripped
$ file example.o
example.o: ELF 64-bit LSB relocatable, x86-64, version 1 (SYSV), not stripped
$ file example.run
example.run: ELF 64-bit LSB executable, x86-64, version 1 (SYSV), dynamically linked, interpreter /lib64/ld-linux-x86-64.so.2, for GNU/Linux 3.2.0, BuildID[sha1]=abcb55b94dd0458b234305e2cb1e725dbf0b2a8b, with debug_info, not stripped

We’ve just used our driver and our example assembly code to build a binary executable! The only thing missing is that we had to write our assembly code by hand, but we want a compiler to write the code for us. We’ll do that next.

Hatchling: Our First Compiler

Rather than writing example.s by hand, we want a compiler to generate it for us. For this activity, we’ll have the compiler read the file on standard input and write the file to standard output; that way, we don’t need to learn about OCaml’s I/O API yet. We can then instruct the command line to feed the source code into the compiler and capture what it prints.

We can start by writing this skeleton of compiler.ml:

let compile (program : int) : string =
  failwith "TODO"
;;

(* The following is essentially a way of writing a "main" in OCaml. *)
let () =
  (* Get a line from standard input and assume that it is our "program". *)
  let program_str = read_line () in
  (* Convert the input to an integer. *)
  let program = int_of_string program_str in
  (* "Compile" the program into assembly. *)
  let assembly = compile program in
  (* Print the resulting assembly code. *)
  print_endline assembly
;;

A more sophisticated compiler might read from one or more files instead of using standard input to get its program. Likewise, it might write to one or more files instead of printing the resulting assembly. It might even call the assembler directly instead of writing a file to speed up the process. The above code has the advantage of being much simpler.

In the above, the compile function has been left unimplemented. Use what you know about OCaml strings to replace the failwith "TODO" with an expression that generates the right assembly code for your program.

Once you’ve implemented your compile function, we can test it. Create a file 19.hatchling containing just the number 19:

$ echo "19" > "19.hatchling"

Then, compile your compiler! Your compiler is just an OCaml program. For such a simplistic program, we can use ocamlc to compile it:

ocamlc compiler.ml -o compiler.run

Next, you can run the compiler. We use the < operator in bash to replace the standard keyboard input with the contents of a file (as if you’d copied the file and pasted it into your terminal):

$ ./compiler.run < 19.hatchling
section .text
global hatchling_main
hatchling_main:
  mov rax, 19
  ret

We can capture that output to a file by using the > operator and then assemble it using nasm. Then, we can link the resulting program with our driver as we did before:

$ ./compiler.run < 19.hatchling > 19.s
$ nasm -f elf64 -o 19.o 19.s
$ clang -g -c -o driver.o driver.c
$ clang -g -o 19.run driver.o 19.o
$ ./19.run
19

We can automate this process using a bash script. A bash script is just a text file that we mark executable which contains terminal commands and, ideally, a preamble to indicate the language it’s written in. Here’s a simple script to accomplish this:

#!/bin/bash -e
# The line above says to run this script using /bin/bash.  The -e causes the
# script to give up if anything goes wrong.

# Get the first argument from the command line.  If there isn't one, complain.
filename="$1"
if [ -z "$filename" ]; then
  echo "You must provide the name of the hatchling file to compile."
  exit 1
fi

# Get rid of the ".hatchling" part of the filename.  This is sloppy but works.
progname="${filename//.hatchling/}"

# Run the commands using these variables to dictate filenames.
./compiler.run < "$filename" > "$progname.s"
nasm -f elf64 -o "$progname.o" "$progname.s"
clang -g -c -o driver.o driver.c
clang -g -o "$progname.run" "driver.o" "$progname.o"

Put the contents of that script in a filenamed compile.sh. Then run

$ chmod +x ./compile.sh

to mark the file executable. Once you’ve done this, you can compile your program just by running the script:

$ ./compile.sh 19.hatchling
$ ./19.run
19

That’s a Compiler!

That’s it! We’ve just built a tool that will take an extraordinarily simple “program” (a document containing an integer) and transform it into something the computer can run. This is a compiler.

There are a few caveats, of course:

This compiler handles an extremely simple language. We will discuss how to generate assembly from an increasingly larger set of features throughout the course.
This compiler does not handle operating-system-specific details like the internals of the object files; we’ve left that to our assembler and linker. You’re welcome to read more about the Executable and Linkable Format, but we won’t talk a lot about it in this course. If you’re curious to see what is being generated, try using objdump on the .o files (e.g. objdump -t 19.o or objdump -t driver.o).
This compiler is using standard input and standard output to handle files. For our future compiler work in this course, we’ll read and write files directly. Since I/O is fairly uninteresting, however, most of this code will be provided for you so you can focus on the compilation part. Doing more complex things here is simply a matter of engineering.

In the meantime, you can try making changes to your compiler to see how it behaves. Can you change the input language in a useful way? Can you find a case in which this compiler does not work correctly? Remember to git commit before trying to make changes!

Acknowledgements

Thanks to Abdulaziz Ghuloum and Joe Politz for some of the structure and content of this activity.