C++ Compilation Process: From Source Code to Object Files

Deep dive into how C++ compilers transform source code through preprocessing, parsing, optimization, and code generation with interactive visualizations.

Abhik SarkarAbhik Sarkar
25 min

Best viewed on desktop for optimal interactive experience

Introduction

When you run g++ main.cpp, a complex chain of transformations occurs, converting human-readable C++ code into machine code. This article explores each stage of the compilation process with interactive visualizations, revealing the magic behind the compiler.

The Compilation Pipeline

C++ Compilation Pipeline

Source Code
main.cpp
Preprocessing
main.i
Parsing
AST
Semantic Analysis
Annotated AST
IR Generation
IR
Optimization
Optimized IR
Code Generation
main.s
Assembly
main.o

Typical File Sizes

.cpp
~1KB
.i
~50KB
AST
~20KB
IR
~15KB
.s
~5KB
.o
~2KB

The compilation process consists of several distinct phases, each transforming the code closer to machine language. Let's explore each phase in detail.

Phase 1: Preprocessing

The preprocessor is the first program that processes your source code before actual compilation begins.

C++ Preprocessor Visualizer

Preprocessing Pipeline

??=
Trigraph Replacement
Legacy character replacement
\
Line Splicing
Join continued lines
[ ]
Tokenization
Break into tokens
#
Macro Expansion
Replace macros
<>
Include Processing
Insert header files
#if
Conditional Compilation
Process conditionals
Original Code
#define PI 3.14159
#define SQUARE(x) ((x) * (x))

float area = PI * SQUARE(radius);
After Preprocessing
float area = 3.14159 * ((radius) * (radius));

💡 Explanation: Macros are replaced with their definitions. Note the parentheses in SQUARE to ensure correct precedence.

Preprocessor Commands

g++ -E main.cpp -o main.iPreprocess only
g++ -dM -E main.cppShow all defined macros
g++ -H main.cppShow included headers

What the Preprocessor Does

  1. Macro Expansion: Replaces all macro definitions with their values
  2. File Inclusion: Processes #include directives
  3. Conditional Compilation: Evaluates #ifdef, #ifndef, #if
  4. Line Control: Manages #line directives for debugging

Preprocessor Directives

// Macro definition #define MAX_SIZE 100 #define SQUARE(x) ((x) * (x)) // Conditional compilation #ifdef DEBUG #define LOG(msg) std::cout << msg << std::endl #else #define LOG(msg) #endif // Include guards #ifndef MYHEADER_H #define MYHEADER_H // Header content #endif // Pragma directives #pragma once #pragma pack(1) #pragma GCC optimize("O3")

Viewing Preprocessed Output

# GCC/G++ g++ -E main.cpp -o main.i # Clang clang++ -E main.cpp -o main.i # MSVC cl /P main.cpp

The preprocessed file (.i) is often 10-100x larger than the original due to expanded headers!

Phase 2: Lexical Analysis (Tokenization)

The compiler breaks the preprocessed code into tokens - the smallest meaningful units.

Token Categories

// Keywords int, class, return, if, while // Identifiers variable_name, functionName, ClassName // Literals 42, 3.14, "string", 'c', true // Operators +, -, *, /, =, ==, !=, <<, >> // Punctuation ;, {, }, (, ), [, ] // Comments (usually stripped) // single-line /* multi-line */

Phase 3: Syntax Analysis (Parsing)

The parser constructs an Abstract Syntax Tree (AST) from the token stream.

Abstract Syntax Tree Explorer

AST Structure

TranslationUnit
FunctionDecl: fibonacci
BuiltinType: int
ParmVarDecl: n(int)
CompoundStmt
IfStmt
ReturnStmt

Source Code

int fibonacci(int n) { if (n <= 1) { return n; } return fibonacci(n - 1) + fibonacci(n - 2); }

🎯 Tip: Click on any node to see its source location highlighted. The AST represents the hierarchical structure of your code, with each node containing type information and relationships.

Understanding the AST

The AST represents the hierarchical structure of your program:

// Source code int add(int a, int b) { return a + b; } // Simplified AST representation FunctionDecl: add ├── ReturnType: int ├── Parameters │ ├── ParmVarDecl: a (int) │ └── ParmVarDecl: b (int) └── CompoundStmt └── ReturnStmt └── BinaryOperator: + ├── DeclRefExpr: a └── DeclRefExpr: b

Viewing the AST

# Clang AST dump clang++ -Xclang -ast-dump main.cpp # GCC AST (via plugin or -fdump-tree options) g++ -fdump-tree-original main.cpp

Phase 4: Semantic Analysis

The semantic analyzer performs type checking and resolves symbols.

Type Checking

int x = "hello"; // Error: cannot convert string to int void* ptr = &x; // OK: implicit conversion auto y = x; // Type deduction: y is int

Name Resolution

namespace A { int x = 1; } namespace B { int x = 2; } using namespace A; int y = x; // Resolves to A::x

Template Instantiation

template<typename T> T max(T a, T b) { return a > b ? a : b; } // Instantiation for int int result = max(5, 10); // Creates max<int>

Phase 5: Intermediate Representation (IR)

Modern compilers convert the AST to an intermediate representation for optimization.

LLVM IR Example

define i32 @add(i32 %a, i32 %b) { entry: %sum = add i32 %a, %b ret i32 %sum }

GCC GIMPLE

add (int a, int b) { int D.2345; D.2345 = a + b; return D.2345; }

Phase 6: Optimization

The optimizer transforms the IR to improve performance and reduce size.

Compiler Optimization Passes

Optimization Level

Active passes: constant-folding, dead-code, inline-expansion, cse

Constant Folding

Evaluate constant expressions at compile time

Before Optimization

int calculate() {
    int a = 2 * 3;
    int b = 10 / 2;
    int c = a + b;
    return c * 4;
}

After Optimization

int calculate() {
    return 44;
    // 2*3=6, 10/2=5
    // 6+5=11, 11*4=44
}

Transformation Steps

1
Originalint a = 2 * 3;
Transformedint a = 6;
ReasonFold 2 * 3 = 6
2
Originalint b = 10 / 2;
Transformedint b = 5;
ReasonFold 10 / 2 = 5
3
Originalint c = a + b;
Transformedint c = 11;
ReasonFold 6 + 5 = 11
4
Originalreturn c * 4;
Transformedreturn 44;
ReasonFold 11 * 4 = 44
Speed Gain
+10-25%
Code Size
~0%
Compile Time
+20-40%
Best For
Math Heavy

Common Optimization Techniques

1. Constant Folding

// Before int x = 2 * 3 + 4; // After int x = 10;

2. Dead Code Elimination

// Before if (false) { expensive_function(); } // After // Code removed entirely

3. Loop Unrolling

// Before for (int i = 0; i < 4; i++) { sum += arr[i]; } // After sum += arr[0]; sum += arr[1]; sum += arr[2]; sum += arr[3];

4. Inline Expansion

// Before inline int square(int x) { return x * x; } int y = square(5); // After int y = 5 * 5; // Function call eliminated

5. Common Subexpression Elimination

// Before int a = b * c + 10; int d = b * c + 20; // After int temp = b * c; int a = temp + 10; int d = temp + 20;

Optimization Levels

# No optimization (fastest compilation) g++ -O0 main.cpp # Basic optimization g++ -O1 main.cpp # Moderate optimization (recommended) g++ -O2 main.cpp # Aggressive optimization g++ -O3 main.cpp # Size optimization g++ -Os main.cpp # Debug-friendly optimization g++ -Og main.cpp

Phase 7: Code Generation

The code generator translates optimized IR to assembly language.

Target-Specific Assembly

; x86-64 assembly for add function add: push rbp mov rbp, rsp mov DWORD PTR [rbp-4], edi mov DWORD PTR [rbp-8], esi mov edx, DWORD PTR [rbp-4] mov eax, DWORD PTR [rbp-8] add eax, edx pop rbp ret ; ARM assembly add: add r0, r0, r1 bx lr

Viewing Assembly Output

# Generate assembly g++ -S main.cpp -o main.s # With optimizations visible g++ -S -O2 -fverbose-asm main.cpp # Intel syntax (instead of AT&T) g++ -S -masm=intel main.cpp

Phase 8: Assembly

The assembler converts assembly code to machine code, producing an object file.

Object File Structure (ELF)

Object File Layout

.text Content

55                   push   %rbp
48 89 e5             mov    %rsp,%rbp
48 83 ec 10          sub    $0x10,%rsp
89 7d fc             mov    %edi,-0x4(%rbp)
8b 45 fc             mov    -0x4(%rbp),%eax
89 c6                mov    %eax,%esi
bf 00 00 00 00       mov    $0x0,%edi
e8 00 00 00 00       callq  printf
b8 00 00 00 00       mov    $0x0,%eax
c9                   leave
c3                   ret

Section Properties

Type:PROGBITS
Flags:r-x
Size:2048 bytes
Address:0x400000
Contains compiled machine code
Read and execute permissions
Loaded into memory as executable
Position-dependent or independent code

🔧 Useful Commands

objdump -h file.o
readelf -S file.o
nm file.o
size file.o

Object File Contents

  1. Header: File format, architecture, entry point
  2. Text Section: Machine code
  3. Data Section: Initialized global variables
  4. BSS Section: Uninitialized global variables
  5. Symbol Table: Function and variable names
  6. Relocation Table: Address fix-up information

Creating Object Files

# Direct compilation to object file g++ -c main.cpp -o main.o # Via assembly g++ -S main.cpp -o main.s as main.s -o main.o

Symbol Tables and Name Mangling

C++ uses name mangling to support function overloading and namespaces.

Symbol Table & Name Mangling

SymbolTypeBindingSectionAddressSize
main
Program entry point
FUNCGLOBAL.text0x40052642
std::vector<int>::push_back(int const&)
Template instantiation
FUNCWEAK.text0x40068089
Calculator::add(int, int)
Class member function
FUNCGLOBAL.text0x40071024
global_counter
Global variable
OBJECTGLOBAL.data0x6010404
vtable for Shape
Virtual function table
OBJECTWEAK.rodata0x40085048
namespace::Utils::format(std::string const&)
Namespace function
FUNCGLOBAL.text0x400900156
operator new(unsigned long)
External symbol (libc++)
FUNCGLOBALUND0x00
printf
External symbol (libc)
FUNCGLOBALUND0x00
Type:
FUNC - Function
OBJECT - Variable
Binding:
GLOBAL - Visible externally
WEAK - Can be overridden
LOCAL - File scope only
Section:
.text - Code
.data - Initialized data
.bss - Uninitialized data
UND - Undefined (external)
Visibility:
DEFAULT - Normal
HIDDEN - Not exported
PROTECTED - Limited

Name Mangling Examples

// C++ function void func(int x); // Mangled: _Z4funci void func(double x); // Mangled: _Z4funcd void func(int x, int y); // Mangled: _Z4funcii // Class methods class MyClass { void method(); // Mangled: _ZN7MyClass6methodEv }; // Namespace functions namespace NS { void func(); // Mangled: _ZN2NS4funcEv }

Examining Symbols

# View symbol table nm main.o # Demangle C++ symbols nm main.o | c++filt # Detailed symbol information objdump -t main.o # Show only undefined symbols nm -u main.o

Compiler Flags Deep Dive

Understanding compiler flags is crucial for controlling the compilation process.

Essential Compilation Flags

# Warning flags -Wall # Enable all common warnings -Wextra # Enable extra warnings -Werror # Treat warnings as errors -Wpedantic # Strict ISO C++ compliance # Debug flags -g # Generate debug information -ggdb # Generate GDB-specific debug info -g3 # Maximum debug information # Optimization flags -O0 # No optimization -O1, -O2, -O3 # Increasing optimization levels -Os # Optimize for size -Ofast # Aggressive optimization (may break standards) # Language standards -std=c++11 # C++11 standard -std=c++17 # C++17 standard -std=c++20 # C++20 standard # Architecture flags -march=native # Optimize for current CPU -m32/-m64 # 32-bit or 64-bit code -mavx2 # Enable AVX2 instructions # Preprocessor flags -D MACRO # Define macro -U MACRO # Undefine macro -I path # Add include path # Linker flags -L path # Add library path -l library # Link with library -static # Static linking -shared # Create shared library # Output flags -o file # Output file name -c # Compile only (no linking) -S # Generate assembly -E # Preprocess only # Analysis flags -ftime-report # Show compilation time breakdown -fmem-report # Show memory usage -Q # Show compiler passes

Compilation Performance

Measuring Compilation Time

# Basic timing time g++ -O2 main.cpp # Detailed breakdown (GCC) g++ -ftime-report main.cpp # Build system timing make clean && time make -j8

Speeding Up Compilation

  1. Precompiled Headers
# Create precompiled header g++ -x c++-header -o header.hpp.gch header.hpp # Use precompiled header g++ main.cpp -include header.hpp
  1. Parallel Compilation
# Make with parallel jobs make -j$(nproc) # CMake parallel build cmake --build . --parallel
  1. Incremental Compilation
# Use object files for incremental builds main: main.o utils.o g++ main.o utils.o -o main %.o: %.cpp g++ -c $< -o $@
  1. Unity Builds
// unity.cpp - Include all source files #include "file1.cpp" #include "file2.cpp" #include "file3.cpp" // Compile as single translation unit

Debugging Compilation Issues

Common Compilation Errors

  1. Syntax Errors
int main() { int x = 5 // Missing semicolon return 0; } // error: expected ';' before 'return'
  1. Type Errors
int* ptr = "string"; // Type mismatch // error: cannot convert 'const char*' to 'int*'
  1. Template Errors
template<typename T> void func(T t) { t.nonexistent(); // Error only on instantiation }

Compiler Diagnostics

# Verbose error messages g++ -fdiagnostics-show-template-tree main.cpp # Colored output g++ -fdiagnostics-color=always main.cpp # Show include stack g++ -H main.cpp # Show macro expansions g++ -E -dD main.cpp

Cross-Compilation

Compiling for different target architectures.

# Cross-compile for ARM arm-linux-gnueabihf-g++ main.cpp -o main.arm # Specify target triple clang++ --target=aarch64-linux-gnu main.cpp # Windows executable on Linux x86_64-w64-mingw32-g++ main.cpp -o main.exe

Modern Compilation Features

# Enable LTO g++ -flto -O2 file1.cpp file2.cpp -o program # With parallel LTO g++ -flto=auto -O2 *.cpp -o program

Profile-Guided Optimization (PGO)

# Step 1: Compile with profiling g++ -fprofile-generate main.cpp -o main # Step 2: Run program to generate profile ./main < typical_input.txt # Step 3: Recompile with profile data g++ -fprofile-use main.cpp -o main_optimized

Sanitizers

# Address Sanitizer (memory errors) g++ -fsanitize=address -g main.cpp # Undefined Behavior Sanitizer g++ -fsanitize=undefined main.cpp # Thread Sanitizer (race conditions) g++ -fsanitize=thread main.cpp

Compiler Internals

GCC Architecture

  • Frontend: Language-specific parsing
  • Middle-end: GIMPLE optimization
  • Backend: RTL and machine code generation

LLVM/Clang Architecture

  • Clang Frontend: C/C++ parsing
  • LLVM IR: Intermediate representation
  • Optimization Passes: Transform IR
  • Backend: Target-specific code generation

Best Practices

  1. Always enable warnings: Use -Wall -Wextra
  2. Use appropriate optimization: -O2 for release, -Og for debug
  3. Specify language standard: -std=c++17 or newer
  4. Include debug symbols: -g for development builds
  5. Use static analysis: -fanalyzer (GCC 10+)
  6. Enable sanitizers: During development and testing
  7. Profile before optimizing: Use -pg for gprof
  8. Consider LTO: For release builds
  9. Use precompiled headers: For large projects
  10. Leverage build caching: ccache, distcc

Conclusion

Understanding the compilation process helps you:

  • Write more efficient code
  • Debug compilation errors effectively
  • Optimize build times
  • Use compiler features effectively
  • Understand performance implications

The journey from source code to object file involves sophisticated transformations, optimizations, and target-specific code generation. Master these concepts to become a better C++ developer.

References

  1. GCC Internals Documentation
  2. LLVM Architecture
  3. C++ Standard Draft
  4. "Engineering a Compiler" by Cooper & Torczon
  5. Compiler Explorer (Godbolt)
Abhik Sarkar

Abhik Sarkar

Machine Learning Consultant specializing in Computer Vision and Deep Learning. Leading ML teams and building innovative solutions.

Share this article

If you found this article helpful, consider sharing it with your network

Mastodon