Hazard Detection: Pipeline Dependencies and Solutions

Understanding Pipeline Hazards

Pipeline hazards are situations that prevent the next instruction from executing during its designated clock cycle. They're the primary obstacles to achieving ideal pipeline performance and require sophisticated hardware mechanisms to detect and resolve.

Modern processors dedicate significant silicon area to hazard detection and resolution, making it one of the most critical aspects of CPU design.

Interactive Hazard Detection Demo

Explore how different types of hazards occur and how modern CPUs detect and resolve them:

Enable ForwardingOut-of-OrderShow Scoreboard

Speed:1000ms

Pipeline State - Cycle 0

Fetch

Empty

Decode

Empty

Execute

Empty

Memory

Empty

Write

Empty

Instruction Memory

I0: ADD R1, R2, R3

I1: SUB R4, R1, R5

I2: AND R6, R4, R7

I3: OR R8, R6, R9

I4: XOR R10, R11, R12

Register File

Hazard Statistics

RAW

WAR

WAW

structural

load-use

Performance

Cycles

Stalls

Forwards

Current Demo: RAW (Read After Write) Hazards

RAW (Read After Write) hazards occur when an instruction needs a value that a previous instruction will write. These are true dependencies that cannot be eliminated, only mitigated through forwarding or stalling.

Types of Pipeline Hazards

1. Structural Hazards

Resource conflicts when hardware cannot support all possible combinations of instructions.

Cycle:  1   2   3   4   5   6   7
Load:   IF  ID  EX  MEM WB
Add:        IF  ID  EX  MEM WB
Store:          IF  ID  EX  MEM  <- Conflict!
                           ^^^
                    Single memory port conflict

Common Structural Hazards:

Single memory port (instruction + data access)
Single ALU (computation + address calculation)
Single register file write port
Limited functional units

Solutions:

Duplicate resources (separate I-cache/D-cache)
Pipeline functional units
Multiple register file ports
Resource arbitration

2. Data Hazards

Dependencies between instructions involving registers or memory.

RAW (Read After Write) - True Dependency

ADD R1, R2, R3    # R1 = R2 + R3
SUB R4, R1, R5    # R4 = R1 - R5 (needs R1)

The most common and problematic type - cannot be eliminated, only mitigated.

WAR (Write After Read) - Anti-dependency

ADD R1, R2, R3    # R1 = R2 + R3 (reads R2)
SUB R2, R4, R5    # R2 = R4 - R5 (writes R2)

Only occurs in out-of-order execution - rename registers to eliminate.

WAW (Write After Write) - Output Dependency

ADD R1, R2, R3    # R1 = R2 + R3
SUB R1, R4, R5    # R1 = R4 - R5 (overwrites R1)

Also only in out-of-order execution - use register renaming.

3. Control Hazards

Disruptions in instruction flow due to branches and jumps.

BEQ R1, R2, label  # Branch if equal
ADD R3, R4, R5     # Fetched speculatively
SUB R6, R7, R8     # May need to be flushed

Impact:

15-20% of instructions are branches
Deep pipelines = high misprediction penalty
Critical for performance

Hazard Detection Mechanisms

1. Combinational Logic Detection

Simple pipelines use combinational logic to detect hazards:

// RAW hazard detection
wire raw_hazard_ex = (id_rs1 == ex_rd && ex_rd != 0) ||
                     (id_rs2 == ex_rd && ex_rd != 0);
                     
wire raw_hazard_mem = (id_rs1 == mem_rd && mem_rd != 0) ||
                      (id_rs2 == mem_rd && mem_rd != 0);

wire stall = raw_hazard_ex || raw_hazard_mem;

2. Scoreboarding

Track instruction status and dependencies dynamically:

Scoreboard Table:
┌─────────┬────────┬────────┬────────┬─────────┐
│ FU      │ Busy   │ Op     │ Dest   │ Sources │
├─────────┼────────┼────────┼────────┼─────────┤
│ ALU1    │ Yes    │ ADD    │ R1     │ R2, R3  │
│ ALU2    │ No     │ -      │ -      │ -       │
│ Load    │ Yes    │ LOAD   │ R4     │ R5+100  │
│ Store   │ No     │ -      │ -      │ -       │
└─────────┴────────┴────────┴────────┴─────────┘

Register Status:
R1: ALU1 (writing)
R4: Load (writing)

Scoreboard Algorithm:

Issue: Check for structural and WAW hazards
Read Operands: Wait for RAW hazards to clear
Execute: Perform operation when operands ready
Write Result: Check for WAR hazards

3. Tomasulo's Algorithm

More sophisticated out-of-order execution with register renaming:

Reservation Stations:
┌──────┬──────┬────┬────┬──────┬──────┬──────┐
│ Name │ Busy │ Op │ Vj │ Vk   │ Qj   │ Qk   │
├──────┼──────┼────┼────┼──────┼──────┼──────┤
│ RS1  │ Yes  │ADD │ 10 │ 20   │ -    │ -    │
│ RS2  │ Yes  │SUB │ -  │ 15   │ RS1  │ -    │
│ RS3  │ No   │ -  │ -  │ -    │ -    │ -    │
└──────┴──────┴────┴────┴──────┴──────┴──────┘

Common Data Bus (CDB): Broadcasts results

Key Features:

Distributed hazard detection
Dynamic scheduling
Register renaming via reservation stations
Eliminates WAR and WAW hazards

Data Forwarding (Bypassing)

Forward results directly from pipeline stages without waiting for writeback:

Forwarding Paths

Pipeline Stages:
┌────┐ ┌────┐ ┌────┐ ┌─────┐ ┌────┐
│ IF │→│ ID │→│ EX │→│ MEM │→│ WB │
└────┘ └────┘ └────┘ └─────┘ └────┘
         ↑      ↑       ↑
         └──────┴───────┘
         Forwarding Paths

Forwarding Logic

// EX/MEM to EX forwarding
if (EX_MEM.RegWrite && 
    EX_MEM.RegisterRd != 0 &&
    EX_MEM.RegisterRd == ID_EX.RegisterRs1) {
    ForwardA = 2; // Forward from EX/MEM
}

// MEM/WB to EX forwarding
if (MEM_WB.RegWrite && 
    MEM_WB.RegisterRd != 0 &&
    MEM_WB.RegisterRd == ID_EX.RegisterRs1 &&
    !(EX_MEM.RegWrite && EX_MEM.RegisterRd == ID_EX.RegisterRs1)) {
    ForwardA = 1; // Forward from MEM/WB
}

Forwarding Priority

When multiple stages can forward:

Most recent value takes priority
EX/MEM over MEM/WB
Check for register 0 (hardwired to zero)

Advanced Hazard Detection

1. Load-Use Hazards

Special case requiring a stall even with forwarding:

LOAD R1, 0(R2)    # Load into R1
ADD  R3, R1, R4   # Uses R1 immediately

Detection:

if (ID_EX.MemRead &&
    ((ID_EX.RegisterRt == IF_ID.RegisterRs) ||
     (ID_EX.RegisterRt == IF_ID.RegisterRt))) {
    stall_pipeline = true;
}

2. Memory Hazards

Store-Load dependencies through memory:

STORE R1, 0(R2)   # Store to address
LOAD  R3, 0(R2)   # Load from same address

Solutions:

Store-Load forwarding
Memory disambiguation
Load speculation with verification

3. Cross-Iteration Dependencies

Loop-carried dependencies:

for (i = 0; i < n; i++) {
    a[i] = a[i-1] + b[i];  // RAW dependency
}

Techniques:

Software pipelining
Loop unrolling
Modulo scheduling

Hardware Implementation

Dependency Check Matrix

For N-way superscalar, check all instruction pairs:

     I0  I1  I2  I3
I0   -   ✓   ✓   ✓
I1   x   -   ✓   ✓
I2   x   x   -   ✓
I3   x   x   x   -

✓ = Check needed
x = Already checked

Complexity: O(N²) comparisons per cycle

CAM-Based Detection

Content-Addressable Memory for fast lookups:

Register Tag CAM:
┌─────┬──────────┐
│ Tag │ Producer │
├─────┼──────────┤
│ R1  │ ROB #5   │
│ R2  │ ROB #3   │
│ R3  │ Ready    │
└─────┴──────────┘

Parallel search all entries

Performance Impact

Hazard Frequency

Typical program characteristics:

RAW hazards: 20-25% of instructions
Control hazards: 15-20% (branches)
Structural hazards: < 5% (with good design)
WAR/WAW: < 5% (in-order) or eliminated (OoO)

CPI Impact

CPI_actual = CPI_ideal + Stalls_structural + Stalls_data + Stalls_control

Where:

CPI_ideal = 1.0 for scalar pipeline
Stalls_data ≈ 0.1-0.3 with forwarding
Stalls_control ≈ 0.1-0.2 with good prediction

Compiler Techniques

1. Instruction Scheduling

Reorder to minimize hazards:

# Original (2 stalls)
LOAD R1, 0(R2)
ADD  R3, R1, R4   # Stall
LOAD R5, 4(R2)
ADD  R6, R5, R7   # Stall

# Scheduled (0 stalls)
LOAD R1, 0(R2)
LOAD R5, 4(R2)
ADD  R3, R1, R4   # No stall
ADD  R6, R5, R7   # No stall

2. Software Pipelining

Overlap loop iterations:

// Original loop
for (i = 0; i < n; i++) {
    load(a[i]);
    compute();
    store(b[i]);
}

// Software pipelined
load(a[0]);
for (i = 1; i < n; i++) {
    compute(i-1);
    load(a[i]);
    store(b[i-1]);
}
compute(n-1);
store(b[n-1]);

3. Predication

Convert control dependencies to data dependencies:

# Branching version
CMP  R1, R2
BNE  skip
ADD  R3, R4, R5
skip:

# Predicated version
CMP    R1, R2
ADDEQ  R3, R4, R5  # Execute if equal

Modern Hazard Detection Examples

Intel Skylake

224-entry reorder buffer
97-entry scheduler
7 execution ports
Zero-cycle register move
Memory disambiguation predictor

AMD Zen 3

256-entry reorder buffer
Improved branch predictor
Op cache for decoded instructions
Enhanced load/store unit

ARM Cortex-A78

Out-of-order execution
Macro-op fusion
Complex branch predictor
Load/store clustering

Best Practices

1. Algorithm Level

Minimize dependencies in inner loops
Use cache-friendly access patterns
Reduce unpredictable branches

2. Code Level

// Avoid tight dependencies
sum = a + b + c + d;  // Chain of dependencies

// Better: tree reduction
t1 = a + b;
t2 = c + d;
sum = t1 + t2;  // Parallel execution

3. Compiler Flags

# GCC/Clang
-O3                    # Aggressive optimization
-march=native          # Target CPU features
-ffast-math           # Relax FP dependencies
-funroll-loops        # Reduce branch hazards

# Profile-guided optimization
gcc -fprofile-generate prog.c
./a.out  # Run with typical data
gcc -fprofile-use prog.c

Debugging Hazards

Performance Counters

perf stat -e \
  resource_stalls.any,\
  resource_stalls.sb,\
  resource_stalls.rs,\
  int_misc.rat_stall_cycles,\
  cycle_activity.stalls_total \
  ./program

Intel VTune Metrics

Pipeline slots analysis
Dependency chains
Port utilization
Stall reasons

Future Directions

1. Machine Learning

Neural hazard predictors
Dynamic scheduling optimization
Workload-specific adaptation

2. Quantum Computing

Superposition of states
No classical hazards
New paradigm needed

3. Neuromorphic Computing

Event-driven execution
Asynchronous operation
Different hazard model

Understanding hazard detection connects to:

CPU Pipelines: Basic pipeline operation
Branch Prediction: Control hazard mitigation
Memory Access Patterns: Memory hazards
Thread Safety: Multi-threaded hazards
Out-of-Order Execution: Advanced scheduling

Conclusion

Hazard detection is the unsung hero of modern CPU performance. While pipelines promise parallel execution, hazards threaten to serialize it. Through sophisticated detection mechanisms, forwarding paths, and dynamic scheduling, modern processors achieve remarkable performance despite frequent dependencies. Understanding these mechanisms is crucial for both hardware designers creating efficient CPUs and software developers writing high-performance code.

Table of Contents

Pipeline State - Cycle 0

Instruction Memory

Register File

Hazard Statistics

Performance

Current Demo: RAW (Read After Write) Hazards