CPU Pipelines & Branch Prediction: Modern Processor Architecture

Understanding CPU Pipelines

Modern CPUs achieve high performance by executing multiple instructions simultaneously through pipelining. Like an assembly line in a factory, different stages of instruction execution happen in parallel, dramatically increasing throughput.

Without pipelining, a CPU would complete one instruction entirely before starting the next. With pipelining, while one instruction is being executed, another can be decoded, and yet another can be fetched—all simultaneously.

Interactive CPU Pipeline Demo

Experience how instructions flow through pipeline stages and see the impact of hazards and branch prediction:

Enable ForwardingBranch Prediction

Speed:1000ms

Pipeline State - Cycle 0

Instruction Fetch

Instruction Decode

Execute

Memory Access

Write Back

Empty

Instruction Memory

0:ADD R1, R2, R3

1:SUB R4, R5, R6

2:MUL R7, R8, R9

3:AND R10, R11, R12

Branch Prediction

Predictions: 0 correct, 0 incorrect

Total Cycles

Instructions

0.00

IPC

Stalls/Flushes

Current Demo: Simple Sequential

Watch how independent instructions flow through the pipeline stages without conflicts. This achieves maximum throughput of 1 instruction per cycle.

The Five Classic Pipeline Stages

1. Instruction Fetch (IF)

Fetch instruction from memory
Update program counter (PC)
Fill instruction queue

2. Instruction Decode (ID)

Decode instruction opcode
Read register operands
Generate control signals

3. Execute (EX)

Perform ALU operations
Calculate memory addresses
Evaluate branch conditions

4. Memory Access (MEM)

Load data from memory
Store data to memory
No operation for ALU instructions

5. Write Back (WB)

Write results to registers
Update processor state
Complete instruction

Pipeline Performance

Ideal Performance

In an ideal pipeline with no hazards:

Throughput: 1 instruction per cycle (after initial fill)
Speedup: Speedup = n × kn + k - 1
- Where n = number of instructions, k = pipeline stages
- Approaches k for large n

Real-World Performance

Actual performance is reduced by:

Pipeline Hazards: 10-30% performance loss
Branch Mispredictions: 10-20 cycle penalty
Cache Misses: 100+ cycle stalls
Dependencies: Reduced instruction-level parallelism

Pipeline Hazards

1. Structural Hazards

Resource conflicts when multiple instructions need the same hardware:

Cycle:  1   2   3   4   5
I1:     IF  ID  EX  MEM WB
I2:         IF  ID  EX  MEM  <- Conflict if single memory port

Solutions:

Duplicate resources (separate I-cache and D-cache)
Pipeline functional units
Arbitration and stalling

2. Data Hazards

Dependencies between instructions:

ADD R1, R2, R3    # R1 = R2 + R3
SUB R4, R1, R5    # R4 = R1 - R5 (needs R1)

Types:

RAW (Read After Write): True dependency
WAR (Write After Read): Anti-dependency
WAW (Write After Write): Output dependency

Solutions:

Forwarding/Bypassing: Route results directly
Pipeline Stalls: Insert bubble cycles
Out-of-Order Execution: Reorder independent instructions

3. Control Hazards

Branch instructions disrupt sequential flow:

if (condition) {
    // Branch taken path
} else {
    // Branch not taken path
}

Without knowing the branch outcome, the pipeline doesn't know which instructions to fetch next.

Branch Prediction

Why Branch Prediction Matters

Modern pipelines are 15-20+ stages deep
Misprediction penalty = pipeline depth
Programs have branches every 5-7 instructions
95%+ accuracy needed for good performance

Types of Branch Predictors

1. Static Prediction

Simple rules without runtime information:

Always Not Taken: Continue sequential execution
Backward Taken, Forward Not Taken: Loops likely taken
Compiler Hints: Use profile-guided optimization

2. Dynamic Prediction

One-Bit Predictor

State: Taken/Not Taken
if (branch_taken != prediction) {
    flip_state();
}

Problem: Alternating branches perform poorly

Two-Bit Saturating Counter

States: Strongly Taken (11)
        Weakly Taken (10)
        Weakly Not Taken (01)
        Strongly Not Taken (00)

More resilient to occasional mispredictions

Local History (Pattern-Based)

Track last N outcomes per branch
2^N entry pattern history table
Captures patterns like TTNTTNTTN

Global History (Correlating)

Use global branch history register
Correlate with other recent branches
Captures inter-branch dependencies

Tournament Predictor

Multiple predictors compete
Meta-predictor chooses best
Adapts to different code patterns

3. Modern Predictors

TAGE (Tagged Geometric)

Multiple tables with different history lengths
Geometric series: 1, 2, 4, 8, 16...
Tagged entries prevent aliasing

Neural Branch Predictors

Perceptron-based learning
Weighted sum of history bits
Online training during execution

Speculative Execution

How It Works

Predict branch direction
Fetch and execute predicted path
Verify prediction when branch resolves
Commit if correct, Flush if wrong

Cycle:  1   2   3   4   5   6   7   8
BEQ:    IF  ID  EX  MEM WB
I1:         IF  ID  EX  MEM WB     <- Speculative
I2:             IF  ID  EX  MEM WB  <- Speculative
                    ^
                    Branch resolves

Managing Speculation

Reorder Buffer (ROB)

Instructions complete out-of-order
Commit in program order
Maintains precise exceptions

Register Renaming

Eliminate false dependencies
Map architectural to physical registers
Enable more parallelism

Advanced Pipeline Techniques

1. Superscalar Execution

Multiple instructions per cycle:

2-way Superscalar:
Cycle:  1   2   3   4   5
I1:     IF  ID  EX  MEM WB
I2:     IF  ID  EX  MEM WB
I3:         IF  ID  EX  MEM
I4:         IF  ID  EX  MEM

Challenges:

Dependency checking complexity
Resource conflicts multiply
Power consumption increases

2. Out-of-Order Execution

Original:   I1 -> I2 (depends on I1) -> I3 (independent)
Executed:   I1 -> I3 -> I2

Key Components:

Issue Queue: Hold waiting instructions
Reservation Stations: Buffer operands
Load/Store Queue: Memory disambiguation

3. Very Long Instruction Word (VLIW)

Compiler schedules parallel operations:

VLIW Instruction: [ALU1 | ALU2 | Load | Store | Branch]

Trade-offs:

Simpler hardware
Complex compiler
Code size increase

Performance Optimization Tips

1. Branch-Friendly Code

Predictable Patterns

// Good: Predictable loop
for (int i = 0; i < 1000; i++) {
    array[i] = compute(i);
}

// Bad: Random branches
for (int i = 0; i < n; i++) {
    if (random() & 1) {  // 50% misprediction
        process_a();
    } else {
        process_b();
    }
}

Branch Elimination

// Branchy version
int max(int a, int b) {
    if (a > b) return a;
    else return b;
}

// Branchless version
int max(int a, int b) {
    int diff = a - b;
    int dsgn = diff >> 31;  // arithmetic shift
    return a - (diff & dsgn);
}

2. Loop Optimizations

Loop Unrolling

// Original
for (int i = 0; i < n; i++) {
    sum += array[i];
}

// Unrolled by 4
for (int i = 0; i < n - 3; i += 4) {
    sum += array[i] + array[i+1] + 
           array[i+2] + array[i+3];
}
// Handle remainder...

Software Pipelining

// Overlap iterations
load(array[0]);
for (int i = 0; i < n - 1; i++) {
    load(array[i + 1]);     // Next iteration
    compute(array[i]);      // Current iteration
    store(result[i]);       // Previous iteration
}

3. Data Dependencies

Break Dependency Chains

// Long dependency chain
sum = a + b + c + d + e + f + g + h;

// Parallel reduction
sum1 = a + b; sum2 = c + d;
sum3 = e + f; sum4 = g + h;
sum = (sum1 + sum2) + (sum3 + sum4);

Measuring Pipeline Performance

Hardware Performance Counters

# Linux perf
perf stat -e cycles,instructions,branches,branch-misses ./program

# Output:
# 1,234,567,890 cycles
# 2,345,678,901 instructions    # 1.9 IPC
# 345,678,901 branches
# 12,345,678 branch-misses      # 3.57%

Key Metrics

Instructions Per Cycle (IPC)
- Ideal: Pipeline width (e.g., 4 for 4-wide)
- Typical: 1.0-2.5 for general code
Branch Misprediction Rate
- Excellent: < 2%
- Good: 2-5%
- Poor: > 10%
Pipeline Stall Cycles
- Data hazards
- Structural hazards
- Cache misses

Modern CPU Examples

Intel Golden Cove (12th Gen)

6-wide decode
12 execution ports
512-entry reorder buffer
Advanced TAGE predictor

AMD Zen 4

6-wide dispatch
10 execution pipes
320-entry reorder buffer
Perceptron branch predictor

Apple M1 Firestorm

8-wide decode
630-entry reorder buffer
Massive 192KB L1 I-cache
Neural branch predictor

Common Pitfalls

1. Unpredictable Branches

// Sorting improves branch prediction
if (data[i] > threshold) {  // Random: 50% miss
    // After sorting: ~0% miss
}

2. Function Pointers

// Indirect branches are hard to predict
for (int i = 0; i < n; i++) {
    result += handlers[type[i]](data[i]);
}

3. Short Loops

// Too short for prediction to adapt
for (int i = 0; i < 3; i++) {
    // Branch predictor hasn't learned pattern
}

Future Directions

1. Machine Learning Predictors

Deep learning for branch prediction
Adaptive to workload characteristics
Better long-history correlation

2. Quantum-Resistant Designs

Speculation attacks (Spectre/Meltdown)
Secure speculation
Performance vs. security trade-offs

3. Heterogeneous Architectures

Big.LITTLE designs
Different pipeline depths
Workload-specific optimization

Understanding CPU pipelines connects to:

Memory Access Patterns: Cache effects on pipeline
Thread Safety: Multi-core pipeline interactions
NUMA Architecture: Memory latency impact
Cache Lines: Instruction cache optimization
Compiler Optimizations: Pipeline-aware code generation

Conclusion

CPU pipelines are the foundation of modern processor performance, turning sequential instruction execution into a parallel assembly line. Branch prediction enables deep pipelines by speculating on program flow, achieving >95% accuracy through sophisticated algorithms. Understanding these concepts helps write code that works with, not against, the CPU architecture for maximum performance.

Table of Contents

Pipeline State - Cycle 0

Instruction Memory

Branch Prediction

Current Demo: Simple Sequential