C/C++ Compilation Model

Consider building the following C++ code given below, using g++/clang:

// fizzbuzz.cpp
#include <iostream>

#define FIZZBUZZ "FizzBuzz"
#define FIZZ "Fizz"
#define BUZZ "Buzz"

int main(int argc, char *argv[])
{
    for (unsigned int i = 0; i <= 100; ++i) {
        if (i % 15 == 0) {
            std::cout << FIZZBUZZ << " ";
        } else if (i % 3 == 0) {
            std::cout << FIZZ << " ";
        } else if (i % 5 == 0) {
            std::cout << BUZZ << " ";
        } else {
            std::cout << i << " ";
        }
    }

    return 0;
}

When you run the following command clang++ fizzbuzz.cpp -o fizzbuzz, you get an executable, fizzbuzz, that you use to execute your code. Here clang has abstracted away the steps involved in going from a source(.cpp) file to an executable that you can run.

At a higher level, there are four main stages involved in going from the source code to the executable. They are shown in the below figure:

Fig 1: C++ Compilation Stages

Note, the above stages are at a very high level and if you need the complete steps in detail, this would be a good place to start.

Now, let us look at each of the steps, shown in the figure, in more detail.

Preprocessing

This stage takes in the source file(.cpp) as the input and does macro expansion, handling of preprocessor directives, include files expansion among various others.

You can run the preprocessor alone using clang like:

clang++ -E fizzbuzz.cpp

This command prints out the representation of the source file needed for the compilation stage. This stage expands the #includes, does a text substitute on the #defines and processes other #directives. The output is then passed to the compilation stage. For example, in our above program, the macros in the std::cout are replaced with the string values as shown below:

// Preprocessed file output
int main(int argc, char *argv[])
{
    for (unsigned int i = 0; i <= 100; ++i) {
        if (i % 15 == 0) {
            std::cout << "FizzBuzz" << " ";
        } else if (i % 3 == 0) {
            std::cout << "Fizz" << " ";
        } else if (i % 5 == 0) {
            std::cout << "Buzz" << " ";
        } else {
            std::cout << i << " ";
        }
    }

    return 0;
}

Note that the preprocessing stage increases the size of the file that is passed to the compiler. This is because, an actual text substitution is being done when expanding the files and hence it is always advised to include only the files (#include) that your program uses. Also, the above clang command prints out the output of the processing stage. If you are building your executable directly, you do not need to worry about this since clang takes care of pipelining the temporary files for you from start to finish.

Compiling

This stage takes in the output of the preprocessing stage and compiles the source to assembly. This stage usually translates the tokens into a parse tree before continuing. It checks the definitions and checks if the code is well formed or not. If all the checks results in success, it goes ahead and creates a file that contains the assembly instructions of the program code.

You can run the compiler, to generate the assembly, using clang like:

clang++ -S fizzbuzz.cpp

Usually, this stage is also responsible for optimizing the code generated for the program. The output of the compilation stage is an assembly file with a .s extension. The output can be analyzed and is something like the below:

	.text
	.file	"fizzbuzz.cpp"
	.section	.text.startup,"ax",@progbits
	.p2align	4, 0x90         # -- Begin function __cxx_global_var_init
	.type	__cxx_global_var_init,@function
__cxx_global_var_init:                  # @__cxx_global_var_init
	.cfi_startproc

    ......
    ......

main:                                   # @main
	.cfi_startproc
# %bb.0:
	pushq	%rbp
	.cfi_def_cfa_offset 16
	.cfi_offset %rbp, -16
	movq	%rsp, %rbp
	.cfi_def_cfa_register %rbp
	subq	$80, %rsp
	movl	$0, -4(%rbp)
	movl	%edi, -8(%rbp)
	movq	%rsi, -16(%rbp)
	movl	$0, -20(%rbp)
.LBB1_1:                                # =>This Inner Loop Header: Depth=1
	cmpl	$100, -20(%rbp)
	ja	.LBB1_13
# %bb.2:                                #   in Loop: Header=BB1_1 Depth=1

And as seen above, there are comments provided in the file to understand what is being done in assembly. The assembly instructions provides a convenient representation of the machine code that will be generated. Also note, you can also edit the generated assembly instructions in the .s file to get clang to generate object code that matches your changes.

Assembling

This stage takes the assembly file(.s) generated in the earlier step and generates the corresponding machine code for the file. The output of this stage is a .o file that can be used by the linker.

You can run the assembly step using clang like:

clang++ -c fizzbuzz.s

You can not directly view the machine code since it is packed in a binary format(ELF) but you can use tools to make sense of them. You can use the nm command, in linux, to list symbols from the object file(.o) file generated in this step. The output of the nm command for our object file(.o) would look something like:

$nm -C fizzbuzz.o
                 U __cxa_atexit
0000000000000000 t __cxx_global_var_init
                 U __dso_handle
0000000000000050 t _GLOBAL__sub_I_fizzbuzz.cpp
0000000000000000 T main
                 U std::ostream::operator<<(unsigned int)
                 U std::ios_base::Init::Init()
                 U std::ios_base::Init::~Init()
                 U std::cout
0000000000000000 b std::__ioinit
                 U std::basic_ostream<char, std::char_traits<char> >& std::operator<< <std::char_traits<char> >(std::basic_ostream<char, std::char_traits<char> >&, char const*)

In the above, symbols marked with U will be linked from other object files, by the linker in the next stage.

Linking

This stage takes all the object files needed by the program and generates the final standalone executable that will be run. This stage can also be used to generate a shared object(.so) that can be dynamically linked with other programs.

The linking is done by a separate tool — ldd to produce the final output file. In Linux, the executable is of the ELF format and other platforms work with different formats. The OS, during runtime, uses the dynamic linker to resolve the symbols and execute the code under that symbol.

And as always, you can go from the source to the final executable/library using the command:

clang++ fizzbuzz.cpp -g -o fizzbuzz

Also, you can disassemble the binary executable, provided it is compiled with debug symbols (-g flag) as above, using the command:

objdump -S fizzbuzz

That’s it. Hope this post briefly explained the various stages of going from the source to the final executable/library. For any discussion, tweet here.

[1] https://clang.llvm.org/
[2] https://www.llvm.org/