CPU Performance & Optimization
Understanding CPU cycles, memory hierarchy, cache optimization, and performance analysis techniques
Best viewed on desktop for optimal interactive experience
Performance Analysis & Optimization
Understanding CPU cycles, memory hierarchy, and optimization techniques
- • Cache-friendly access patterns can improve performance 10-100x
- • Modern CPUs execute 4+ instructions per cycle through pipelining
- • Memory latency hasn't improved as fast as CPU speed (memory wall)
- • Profile before optimizing - premature optimization is evil
- • Consider algorithmic improvements before micro-optimizations
Understanding Performance
Performance optimization is both an art and a science. Modern CPUs are incredibly complex, with multiple levels of caches, out-of-order execution, branch prediction, and SIMD instructions.
CPU Pipeline
Modern CPUs use pipelining to execute multiple instructions simultaneously:
- Fetch: Get instruction from memory
- Decode: Interpret the instruction
- Execute: Perform the operation
- Memory: Access memory if needed
- Write Back: Store results
Pipeline Hazards
- Data Hazards: When an instruction depends on the result of a previous instruction
- Control Hazards: Branch instructions that change program flow
- Structural Hazards: Hardware resource conflicts
Memory Hierarchy
The memory hierarchy is crucial for performance:
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| Registers | < 1KB | 0 cycles | Unlimited |
| L1 Cache | 32-64KB | 1-3 cycles | Very High |
| L2 Cache | 256KB-1MB | 10-20 cycles | High |
| L3 Cache | 8-32MB | 40-75 cycles | Medium |
| RAM | GB | 200+ cycles | Low |
| Disk | TB | Millions | Very Low |
Cache Lines
- Typical cache line size: 64 bytes
- Data is transferred in cache line units
- Spatial Locality: Access nearby data
- Temporal Locality: Reuse recently accessed data
Optimization Techniques
1. Loop Unrolling
Reduce loop overhead by processing multiple iterations per loop:
// Before for (int i = 0; i < n; i++) { sum += arr[i]; } // After (unrolled by 4) for (int i = 0; i < n; i += 4) { sum += arr[i] + arr[i+1] + arr[i+2] + arr[i+3]; }
Benefits: Reduces branch overhead, enables better instruction-level parallelism Drawbacks: Increased code size, complexity
2. Cache Blocking (Tiling)
Process data in cache-sized blocks:
// Matrix multiplication with blocking for (int ii = 0; ii < n; ii += BLOCK_SIZE) { for (int jj = 0; jj < n; jj += BLOCK_SIZE) { for (int kk = 0; kk < n; kk += BLOCK_SIZE) { // Process block for (int i = ii; i < min(ii + BLOCK_SIZE, n); i++) { for (int j = jj; j < min(jj + BLOCK_SIZE, n); j++) { for (int k = kk; k < min(kk + BLOCK_SIZE, n); k++) { C[i][j] += A[i][k] * B[k][j]; } } } } } }
3. SIMD Vectorization
Process multiple data elements with single instruction:
// Using intrinsics #include <immintrin.h> // Process 8 floats at once with AVX __m256 vec_a = _mm256_load_ps(&a[i]); __m256 vec_b = _mm256_load_ps(&b[i]); __m256 vec_result = _mm256_add_ps(vec_a, vec_b); _mm256_store_ps(&result[i], vec_result);
4. Prefetching
Bring data into cache before it's needed:
// Manual prefetching for (int i = 0; i < n; i++) { __builtin_prefetch(&arr[i + 64], 0, 1); // Prefetch 64 elements ahead process(arr[i]); }
Access Patterns Matter
Sequential vs Random Access
// Good: Sequential access (cache-friendly) for (int i = 0; i < n; i++) { sum += arr[i]; } // Bad: Random access (cache unfriendly) for (int i = 0; i < n; i++) { sum += arr[random_indices[i]]; }
Row-Major vs Column-Major
// C++ uses row-major order // Good: Access along rows for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { process(matrix[i][j]); } } // Bad: Access along columns (stride = row_size) for (int j = 0; j < cols; j++) { for (int i = 0; i < rows; i++) { process(matrix[i][j]); } }
Profiling & Measurement
Tools
- perf (Linux): Hardware performance counters
- VTune (Intel): Detailed CPU analysis
- gprof: Function-level profiling
- cachegrind: Cache simulation
Key Metrics
- IPC (Instructions Per Cycle): Efficiency measure
- Cache Miss Rate: L1/L2/L3 miss percentages
- Branch Misprediction Rate: Pipeline stall frequency
- Memory Bandwidth Utilization: GB/s used vs available
Profiling Commands
# Basic perf stat perf stat ./program # Detailed cache analysis perf stat -e cache-references,cache-misses ./program # Record and analyze perf record ./program perf report
Common Pitfalls
1. False Sharing
Multiple threads modifying different data in same cache line:
// Bad: False sharing struct Counter { int thread1_count; // Same cache line int thread2_count; // Causes invalidation }; // Good: Padding to avoid false sharing struct Counter { alignas(64) int thread1_count; // Own cache line alignas(64) int thread2_count; // Own cache line };
2. Aliasing
Compiler can't optimize when pointers might alias:
// Use restrict keyword (C) or __restrict (C++) void add(float* __restrict a, float* __restrict b, float* __restrict c, int n) { for (int i = 0; i < n; i++) { c[i] = a[i] + b[i]; // Compiler knows no aliasing } }
3. Premature Optimization
"Premature optimization is the root of all evil" - Donald Knuth
Best Practices:
- Profile first, optimize second
- Focus on algorithmic improvements
- Optimize hot paths (80/20 rule)
- Maintain readable code
Architecture-Specific Optimizations
x86-64 Specifics
- 16 general-purpose registers
- SIMD: SSE (128-bit), AVX (256-bit), AVX-512 (512-bit)
- Out-of-order execution window: ~200 instructions
- Branch prediction: ~95% accuracy
ARM Specifics
- 31 general-purpose registers
- NEON SIMD instructions
- Energy-efficient design
- Weaker memory ordering
Benchmark Results
Typical speedups from optimizations:
| Technique | Typical Speedup | Best Case |
|---|---|---|
| Loop Unrolling | 1.5-3x | 5x |
| Cache Blocking | 2-5x | 10x |
| SIMD | 2-8x | 16x |
| Prefetching | 1.2-2x | 3x |
| All Combined | 5-20x | 100x+ |
Conclusion
Performance optimization requires understanding:
- Hardware architecture
- Memory hierarchy
- Compiler capabilities
- Profiling tools
- Trade-offs between optimizations
Remember: Measure, don't guess!
