CPU Performance & Optimization

12 min

Understanding CPU cycles, memory hierarchy, cache optimization, and performance analysis techniques

Best viewed on desktop for optimal interactive experience

Performance Analysis & Optimization

Understanding CPU cycles, memory hierarchy, and optimization techniques

Performance Tips:
  • • Cache-friendly access patterns can improve performance 10-100x
  • • Modern CPUs execute 4+ instructions per cycle through pipelining
  • • Memory latency hasn't improved as fast as CPU speed (memory wall)
  • • Profile before optimizing - premature optimization is evil
  • • Consider algorithmic improvements before micro-optimizations

Understanding Performance

Performance optimization is both an art and a science. Modern CPUs are incredibly complex, with multiple levels of caches, out-of-order execution, branch prediction, and SIMD instructions.

CPU Pipeline

Modern CPUs use pipelining to execute multiple instructions simultaneously:

  1. Fetch: Get instruction from memory
  2. Decode: Interpret the instruction
  3. Execute: Perform the operation
  4. Memory: Access memory if needed
  5. Write Back: Store results

Pipeline Hazards

  • Data Hazards: When an instruction depends on the result of a previous instruction
  • Control Hazards: Branch instructions that change program flow
  • Structural Hazards: Hardware resource conflicts

Memory Hierarchy

The memory hierarchy is crucial for performance:

LevelSizeLatencyBandwidth
Registers< 1KB0 cyclesUnlimited
L1 Cache32-64KB1-3 cyclesVery High
L2 Cache256KB-1MB10-20 cyclesHigh
L3 Cache8-32MB40-75 cyclesMedium
RAMGB200+ cyclesLow
DiskTBMillionsVery Low

Cache Lines

  • Typical cache line size: 64 bytes
  • Data is transferred in cache line units
  • Spatial Locality: Access nearby data
  • Temporal Locality: Reuse recently accessed data

Optimization Techniques

1. Loop Unrolling

Reduce loop overhead by processing multiple iterations per loop:

// Before for (int i = 0; i < n; i++) { sum += arr[i]; } // After (unrolled by 4) for (int i = 0; i < n; i += 4) { sum += arr[i] + arr[i+1] + arr[i+2] + arr[i+3]; }

Benefits: Reduces branch overhead, enables better instruction-level parallelism Drawbacks: Increased code size, complexity

2. Cache Blocking (Tiling)

Process data in cache-sized blocks:

// Matrix multiplication with blocking for (int ii = 0; ii < n; ii += BLOCK_SIZE) { for (int jj = 0; jj < n; jj += BLOCK_SIZE) { for (int kk = 0; kk < n; kk += BLOCK_SIZE) { // Process block for (int i = ii; i < min(ii + BLOCK_SIZE, n); i++) { for (int j = jj; j < min(jj + BLOCK_SIZE, n); j++) { for (int k = kk; k < min(kk + BLOCK_SIZE, n); k++) { C[i][j] += A[i][k] * B[k][j]; } } } } } }

3. SIMD Vectorization

Process multiple data elements with single instruction:

// Using intrinsics #include <immintrin.h> // Process 8 floats at once with AVX __m256 vec_a = _mm256_load_ps(&a[i]); __m256 vec_b = _mm256_load_ps(&b[i]); __m256 vec_result = _mm256_add_ps(vec_a, vec_b); _mm256_store_ps(&result[i], vec_result);

4. Prefetching

Bring data into cache before it's needed:

// Manual prefetching for (int i = 0; i < n; i++) { __builtin_prefetch(&arr[i + 64], 0, 1); // Prefetch 64 elements ahead process(arr[i]); }

Access Patterns Matter

Sequential vs Random Access

// Good: Sequential access (cache-friendly) for (int i = 0; i < n; i++) { sum += arr[i]; } // Bad: Random access (cache unfriendly) for (int i = 0; i < n; i++) { sum += arr[random_indices[i]]; }

Row-Major vs Column-Major

// C++ uses row-major order // Good: Access along rows for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { process(matrix[i][j]); } } // Bad: Access along columns (stride = row_size) for (int j = 0; j < cols; j++) { for (int i = 0; i < rows; i++) { process(matrix[i][j]); } }

Profiling & Measurement

Tools

  1. perf (Linux): Hardware performance counters
  2. VTune (Intel): Detailed CPU analysis
  3. gprof: Function-level profiling
  4. cachegrind: Cache simulation

Key Metrics

  • IPC (Instructions Per Cycle): Efficiency measure
  • Cache Miss Rate: L1/L2/L3 miss percentages
  • Branch Misprediction Rate: Pipeline stall frequency
  • Memory Bandwidth Utilization: GB/s used vs available

Profiling Commands

# Basic perf stat perf stat ./program # Detailed cache analysis perf stat -e cache-references,cache-misses ./program # Record and analyze perf record ./program perf report

Common Pitfalls

1. False Sharing

Multiple threads modifying different data in same cache line:

// Bad: False sharing struct Counter { int thread1_count; // Same cache line int thread2_count; // Causes invalidation }; // Good: Padding to avoid false sharing struct Counter { alignas(64) int thread1_count; // Own cache line alignas(64) int thread2_count; // Own cache line };

2. Aliasing

Compiler can't optimize when pointers might alias:

// Use restrict keyword (C) or __restrict (C++) void add(float* __restrict a, float* __restrict b, float* __restrict c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; // Compiler knows no aliasing } }

3. Premature Optimization

"Premature optimization is the root of all evil" - Donald Knuth

Best Practices:

  1. Profile first, optimize second
  2. Focus on algorithmic improvements
  3. Optimize hot paths (80/20 rule)
  4. Maintain readable code

Architecture-Specific Optimizations

x86-64 Specifics

  • 16 general-purpose registers
  • SIMD: SSE (128-bit), AVX (256-bit), AVX-512 (512-bit)
  • Out-of-order execution window: ~200 instructions
  • Branch prediction: ~95% accuracy

ARM Specifics

  • 31 general-purpose registers
  • NEON SIMD instructions
  • Energy-efficient design
  • Weaker memory ordering

Benchmark Results

Typical speedups from optimizations:

TechniqueTypical SpeedupBest Case
Loop Unrolling1.5-3x5x
Cache Blocking2-5x10x
SIMD2-8x16x
Prefetching1.2-2x3x
All Combined5-20x100x+

Conclusion

Performance optimization requires understanding:

  • Hardware architecture
  • Memory hierarchy
  • Compiler capabilities
  • Profiling tools
  • Trade-offs between optimizations

Remember: Measure, don't guess!

If you found this explanation helpful, consider sharing it with others.

Mastodon