Tools for examining different phases of compiling and running a C program

This page provides information about the different phases of compiling and running a C (or C++) program and tools that can be used to examine the results of these different phases.

The different phases examined below are:

  1. the preprocessor (expands #'s)
  2. the compiler (produces .s or .o files)
  3. the assembler (produces .o files from .s files)
  4. the link editor (produces a.out files)
  5. the runtime linker (loads and links shared libraries used by a.out)
The different tools used to examine compiler output include:

More information and examples using some of these tools to examine .o and a.out files (hexdump, strings, objdump, gdb).


The following program is used as an example below (it is also available in ~newhall/public/cs75/compilecycle/ with a Makefile for building .o and executable files):
// simple.c:
#include <unistd.h>

#define MAX  10

int foo(int y);

main() {

  int x, i;
  char buf[10];

  for(i=0; i < MAX; i++) {
    x = foo(i);
    // a crazy way to print to stdout
    sprintf(buf, "%d", x);
    write(0, buf, strlen(buf));
    buf[0] = '\n';
    write(0, buf, 1);
  }

}
int foo(int y) {
  return y*y;
}

The Unix file command can be used to find out information about the type of a file. For example:
# the C source file:
#
$ file simple.c
  simple.c: ASCII C program text

# the object file: produces relocatable machine code
#   ELF: stands for Executable and Linking Format, and is the format for
#        .o, a.out, and .so files produced by gcc.  The format is necessary
#        so that programs that process these files, and the OS, know how
#        to find different parts of the code and data in this file
#   Intel 80386: is the target architecture 
#   not stripped: means that this .o file includes a symbol table
#
$ file simple.o
  simple.o: ELF 32-bit LSB relocatable, Intel 80386, version 1 (SYSV), not stripped

# the executable file: 
#
$ file simple
  simple: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for 
  GNU/Linux 2.6.8, dynamically linked (uses shared libs), not stripped

# a shared object file (dynamically linked library): 
#  
$ file /lib/libc-2.7.so 
  /lib/libc-2.7.so: ELF 32-bit LSB shared object, Intel 80386, version 1 (SYSV),
  for GNU/Linux 2.6.8, stripped

  1. The preprocessor

    The C preprocessor (cpp) is the first part of the compiler to run. You can run cpp directly on the simple.c or you can run gcc with the -E flag to run just the preprocessor part of the compiler that expands #include (replaces them with .h file contents), #define (replaces macro and constant use with their definition), and #if (determines conditional inclusion):
    # run cpp: 
    $ cpp simple.c | less
    
    # run just the preprocessor part of gcc:
    $gcc -E simple.c  | less
    
    # look at the output to see what happens to #includes and #defines from simple.c
    
    here is a very detailed reference about the pre-processor.
  2. The core compiler

    The core of the C compiler translates the preprocessor's C source code output to machine-specific assembly code (an .s file). Some core compilers may directly translate to relocatable binary machine code (a .o file) instead of to assembly code (assembly code is approximately a human-readable form of machine code).

    This phase does the bulk of the compilation work, translating a program written in the C high-level programming language to low-level instructions for a specific instruction set architecture (ISA). A processor microarchitecture that implments this ISA can execute these instructions. For example, both Intel and AMD have processors that can execute the IA32 ISA.

    Use the -S option to gcc to produce a .s file:

    $ gcc -S simple.c
    
    This creates a text file, simple.s, of the C to assembly code translation. The simple.s file can be viewed using a text editor:
    $ vim simple.s
    

    Use the -c option to gcc to produce a .o file:

    $ gcc -c simple.c
    
    You can see the assembly code in simple.o using either objdump or gdb (all addresses are listed in hexidecimal (base 16)):
    $ objdump -d simple.o
      ...
      00000000 
    : 0: 8d 4c 24 04 lea 0x4(%esp),%ecx 4: 83 e4 f0 and $0xfffffff0,%esp 7: ff 71 fc pushl -0x4(%ecx) a: 55 push %ebp b: 89 e5 mov %esp,%ebp d: 51 push %ecx e: 83 ec 34 sub $0x34,%esp 11: 65 a1 14 00 00 00 mov %gs:0x14,%eax 17: 89 45 f8 mov %eax,-0x8(%ebp) 1a: 31 c0 xor %eax,%eax ...
    $ gdb simple.o
     (gdb) disass main
     (gdb) disass foo
     (gdb) quit
    
  3. The Assembler

    If the core compiler produces assembly code (vs. relocatable object code), then the Assembler part of the compiler runs next to translate the assembly code to relocatable object code. This step is a very simple translation of the core compiler's assembly code output to its corresponding binary machine code equivalent. Some ISAs may have a handful of assembly instructions that are implemented by a sequence of two or more machine code instructions, but this is mostly a simple 1-1 mapping of assembly to machine instructions.
  4. The link editor

    The link editor creates an executable file (a.out file) from one or more .o files and .a or .so files (static or dynamic libraries):
    # create an executable file from simple.o and some standard libraries that gcc automatically links in:
    gcc -o simple simple.o
    

    Disassembling Executable Code:

    You can use objdump (or in gdb the disass command) to disassemble the code in the executable (simple) to see how it differs from the code in simple.o (look at the call instructions)
    $ objdump -d simple
     ...
    08048434 
    : 8048434: 8d 4c 24 04 lea 0x4(%esp),%ecx 8048438: 83 e4 f0 and $0xfffffff0,%esp 804843b: ff 71 fc pushl -0x4(%ecx) 804843e: 55 push %ebp 804843f: 89 e5 mov %esp,%ebp 8048441: 51 push %ecx 8048442: 83 ec 34 sub $0x34,%esp 8048445: 65 a1 14 00 00 00 mov %gs:0x14,%eax 804844b: 89 45 f8 mov %eax,-0x8(%ebp) 804844e: 31 c0 xor %eax,%eax 8048450: c7 45 e4 00 00 00 00 movl $0x0,-0x1c(%ebp) 8048457: eb 6d jmp 80484c6 ...

    Viewing the Symbol Table:

    Use nm (or objdump -t) to list the symbol table from an a.out or .so file
    $ nm --format sysv simple	# system V format is easier to read than bsd format which is the default
    
    Name                  Value   Class        Type         Size     Line  Section
    
    ...
    foo                 |080484e6|   T  |              FUNC|0000000c|     |.text
    frame_dummy         |08048410|   t  |              FUNC|        |     |.text
    main                |08048434|   T  |              FUNC|000000b2|     |.text
    p.5841              |080496dc|   d  |            OBJECT|        |     |.data
    sprintf@@GLIBC_2.0  |        |   U  |              FUNC|00000034|     |*UND*
    strlen@@GLIBC_2.0   |        |   U  |              FUNC|000000af|     |*UND*
    write@@GLIBC_2.0    |        |   U  |              FUNC|00000076|     |*UND*
    
    Section *UND* means that these symbols are from .so files that will be
    loaded at run-time, Section .text means that these are in the .text 
    section of the executable file (the code section).  Class T and t are 
    functions and D and d are data (global variables), R is read-only data, 
    the Value column gives the address of the function or data.
    
  5. The runtime linker and dynamically linked libraries:

    The runtime linker loads and links shared object files (dynamically linked library code) used by the a.out at runtime. Calls from the a.out to library functions are bound to functions from library shared object files, loaded at runtime into the process' address space.

    Listing shared object dependencies and the dynamic symbol table:

    ldd will list shared object dependencies on an a.out or .so files (i.e. which shared objects need to be loaded at runtime to run the a.out or with loading the .so):
    ldd simple
            linux-gate.so.1 =>  (0xb7ef2000)
            libc.so.6 => /lib/tls/i686/cmov/libc.so.6 (0xb7d8a000)
            /lib/ld-linux.so.2 (0xb7ef3000)
    
    Use objdump -T to see dynamic symbol table entries from a .so file (here we are just finding the one for write):
    $ objdump -T /lib/libc.so.6 | grep write     
    
    000b6ab0  w   DF .text  00000076  GLIBC_2.0   write
    

    Runtime linker Steps and viewing at runtime:

    The runtime linker sets entries in the PLT (procedure linkage table) and/or the GOT (global offset table) at runtime to bind variables and functions to their locations in shared objects (dynamically linked libraries) that are loaded at runtime. What exactly is done depends on the format of the a.out file and the underlying OS/Arch.

    If you do objdump -d simple you can see that the call to write in main is a call into the .plt section of the a.out (which contains the PLT):

    08048434 
    : ... 804849e: e8 c9 fe ff ff call 804836c Disassembly of section .plt: ... 0804836c : 804836c: ff 25 c4 96 04 08 jmp *0x80496c4 8048372: 68 10 00 00 00 push $0x10 8048377: e9 c0 ff ff ff jmp 804833c <_init+0x30>
    The jmp *0x80496c4 instruction is jumping to a value stored in the Global Offset Table (GOT) at address 0x80496b0. The value in the GOT is loaded at runtime by the dynamic linker.

    To see what this value is set to at runtime, disassemble instructions in gdb:

    1. set a breakpoint at write and run
      $ gdb simple
      (gdb) break *0x0804849e
      (gdb) cont
      (gdb) disass main
      ...
      0x0804849e :  call   0x804836c 
      ...
      
    2. disassemble the PLT entry that is called from main (the call to write in libc.so):
      
      
      (gdb) disass 0x804836c
      Dump of assembler code for function write@plt:
      0x0804836c :       jmp    *0x80496c4
      0x08048372 :       push   $0x10
      0x08048377 :      jmp    0x804833c <_init+48>
      
      
    3. disassemble instructions around 0x80496c4 just to see that the jmp target is stored in a location in the GOT (ignore the disassembled "instructions" in the GOT: the GOT stores jump target addresses not instructions, so the disassembled target addresses have no meaning):
      (gdb) disass 0x80496c4
      Dump of assembler code for function _GLOBAL_OFFSET_TABLE_:
      0x080496b0 <_GLOBAL_OFFSET_TABLE_+0>:   fcoml  0x66680804(%ebp)
      0x080496b6 <_GLOBAL_OFFSET_TABLE_+6>:   icebp  
      0x080496b7 <_GLOBAL_OFFSET_TABLE_+7>:   mov    $0x30,%bh
      0x080496b9 <_GLOBAL_OFFSET_TABLE_+9>:   fdiv   %st,%st(0)
      0x080496bb <_GLOBAL_OFFSET_TABLE_+11>:  mov    $0xb0,%bh
      0x080496bd <_GLOBAL_OFFSET_TABLE_+13>:  xchg   %eax,%ebx
      0x080496be <_GLOBAL_OFFSET_TABLE_+14>:  fnsave 0x8048362(%edi)
      0x080496c4 <_GLOBAL_OFFSET_TABLE_+20>:  rclb   -0x7c8f481b(%edx)
      0x080496ca <_GLOBAL_OFFSET_TABLE_+26>:  fidivl -0x481fbdb0(%edi)
      0x080496d0 <_GLOBAL_OFFSET_TABLE_+32>:  mov    %al,0x80483
      
    4. print out the value stored in the GOT table for write (the GOT entry is at address 0x80496c4 and it contains the address of the write function (0xb7e592d0)):
      (gdb) print/x *0x80496c4
      $2 = 0xb7e592d0
      
    5. now try disassembling code around address 0xb7e592d0 to see code from the write function from libc.so:
      (gdb) disass 0xb7e592d0
      Dump of assembler code for function write:
      0xb7e592d0 :  cmpl   $0x0,%gs:0xc
      0xb7e592d8 :  jne    0xb7e592fc 
      0xb7e592da :  push   %ebx
      0xb7e592db :  mov    0x10(%esp),%edx
      0xb7e592df :  mov    0xc(%esp),%ecx
      0xb7e592e3 :  mov    0x8(%esp),%ebx
      ...
      

    Here is some more information about readelf, objdump, and other tools: