CPU Performance & Optimization

Performance Analysis & Optimization

Understanding CPU cycles, memory hierarchy, and optimization techniques

Metric Focus

Cache Level

Performance Tips:

• Cache-friendly access patterns can improve performance 10-100x
• Modern CPUs execute 4+ instructions per cycle through pipelining
• Memory latency hasn't improved as fast as CPU speed (memory wall)
• Profile before optimizing - premature optimization is evil
• Consider algorithmic improvements before micro-optimizations

Understanding Performance

Performance optimization is both an art and a science. Modern CPUs are incredibly complex, with multiple levels of caches, out-of-order execution, branch prediction, and SIMD instructions.

CPU Pipeline

Modern CPUs use pipelining to execute multiple instructions simultaneously:

Fetch: Get instruction from memory
Decode: Interpret the instruction
Execute: Perform the operation
Memory: Access memory if needed
Write Back: Store results

Pipeline Hazards

Data Hazards: When an instruction depends on the result of a previous instruction
Control Hazards: Branch instructions that change program flow
Structural Hazards: Hardware resource conflicts

Memory Hierarchy

The memory hierarchy is crucial for performance:

Level	Size	Latency	Bandwidth
Registers	< 1KB	0 cycles	Unlimited
L1 Cache	32-64KB	1-3 cycles	Very High
L2 Cache	256KB-1MB	10-20 cycles	High
L3 Cache	8-32MB	40-75 cycles	Medium
RAM	GB	200+ cycles	Low
Disk	TB	Millions	Very Low

Cache Lines

Typical cache line size: 64 bytes
Data is transferred in cache line units
Spatial Locality: Access nearby data
Temporal Locality: Reuse recently accessed data

Optimization Techniques

1. Loop Unrolling

Reduce loop overhead by processing multiple iterations per loop:

// Before
for (int i = 0; i < n; i++) {
    sum += arr[i];
}

// After (unrolled by 4)
for (int i = 0; i < n; i += 4) {
    sum += arr[i] + arr[i+1] + arr[i+2] + arr[i+3];
}

Benefits: Reduces branch overhead, enables better instruction-level parallelism Drawbacks: Increased code size, complexity

2. Cache Blocking (Tiling)

Process data in cache-sized blocks:

// Matrix multiplication with blocking
for (int ii = 0; ii < n; ii += BLOCK_SIZE) {
    for (int jj = 0; jj < n; jj += BLOCK_SIZE) {
        for (int kk = 0; kk < n; kk += BLOCK_SIZE) {
            // Process block
            for (int i = ii; i < min(ii + BLOCK_SIZE, n); i++) {
                for (int j = jj; j < min(jj + BLOCK_SIZE, n); j++) {
                    for (int k = kk; k < min(kk + BLOCK_SIZE, n); k++) {
                        C[i][j] += A[i][k] * B[k][j];
                    }
                }
            }
        }
    }
}

3. SIMD Vectorization

Process multiple data elements with single instruction:

// Using intrinsics
#include <immintrin.h>

// Process 8 floats at once with AVX
__m256 vec_a = _mm256_load_ps(&a[i]);
__m256 vec_b = _mm256_load_ps(&b[i]);
__m256 vec_result = _mm256_add_ps(vec_a, vec_b);
_mm256_store_ps(&result[i], vec_result);

4. Prefetching

Bring data into cache before it's needed:

// Manual prefetching
for (int i = 0; i < n; i++) {
    __builtin_prefetch(&arr[i + 64], 0, 1);  // Prefetch 64 elements ahead
    process(arr[i]);
}

Access Patterns Matter

Sequential vs Random Access

// Good: Sequential access (cache-friendly)
for (int i = 0; i < n; i++) {
    sum += arr[i];
}

// Bad: Random access (cache unfriendly)
for (int i = 0; i < n; i++) {
    sum += arr[random_indices[i]];
}

Row-Major vs Column-Major

// C++ uses row-major order
// Good: Access along rows
for (int i = 0; i < rows; i++) {
    for (int j = 0; j < cols; j++) {
        process(matrix[i][j]);
    }
}

// Bad: Access along columns (stride = row_size)
for (int j = 0; j < cols; j++) {
    for (int i = 0; i < rows; i++) {
        process(matrix[i][j]);
    }
}

Profiling & Measurement

Tools

perf (Linux): Hardware performance counters
VTune (Intel): Detailed CPU analysis
gprof: Function-level profiling
cachegrind: Cache simulation

Key Metrics

IPC (Instructions Per Cycle): Efficiency measure
Cache Miss Rate: L1/L2/L3 miss percentages
Branch Misprediction Rate: Pipeline stall frequency
Memory Bandwidth Utilization: GB/s used vs available

Profiling Commands

# Basic perf stat
perf stat ./program

# Detailed cache analysis
perf stat -e cache-references,cache-misses ./program

# Record and analyze
perf record ./program
perf report

Common Pitfalls

Multiple threads modifying different data in same cache line:

// Bad: False sharing
struct Counter {
    int thread1_count;  // Same cache line
    int thread2_count;  // Causes invalidation
};

// Good: Padding to avoid false sharing
struct Counter {
    alignas(64) int thread1_count;  // Own cache line
    alignas(64) int thread2_count;  // Own cache line
};

2. Aliasing

Compiler can't optimize when pointers might alias:

// Use restrict keyword (C) or __restrict (C++)
void add(float* __restrict a, 
         float* __restrict b, 
         float* __restrict c, int n) {
    for (int i = 0; i < n; i++) {
        c[i] = a[i] + b[i];  // Compiler knows no aliasing
    }
}