Memory Access Patterns: Sequential vs Strided
Understand how different memory access patterns impact cache performance, prefetcher efficiency, and overall application speed through interactive visualizations.
Best viewed on desktop for optimal interactive experience
Understanding Memory Access Patterns
Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help accelerate your program.
Modern processors are designed to exploit predictable memory access patterns, making the difference between optimal and suboptimal patterns often 10x or more in real-world performance.
Interactive Memory Access Pattern Demo
Experience the dramatic performance difference between sequential and strided memory access patterns:
Memory Access Pattern Optimization
Access Pattern Demonstration
Sequential Access Pattern
sum += array[i];
}
✓ Optimal cache utilization
✓ Prefetcher friendly
✓ Maximum memory bandwidth
Strided Access Pattern
sum += array[i];
}
✗ Poor cache utilization
✗ Confuses prefetcher
✗ Wastes memory bandwidth
Access Sequence Visualization
Sequential Access
L1 Cache (64 KB, 64B lines)
Hardware Prefetcher
Pattern detected!
Main Memory (DDR4)
Strided Access
L1 Cache (64 KB, 64B lines)
Hardware Prefetcher
Pattern unclear...
Main Memory (DDR4)
Sequential Hit Rate
Strided Hit Rate
Sequential BW
Strided BW
Key Insights:
- • Sequential access maximizes spatial locality - when you access element N, you'll likely need N+1 soon
- • Cache lines are 64 bytes, holding 8 doubles or 16 integers
- • Strided access wastes cache space by loading full lines but using only one element
- • Hardware prefetchers detect sequential patterns and load data before it's needed
- • Large strides can cause cache thrashing where useful data is evicted before use
Why Access Patterns Matter
The Memory Hierarchy
Modern computers have a multi-level memory hierarchy:
Level | Size | Latency | Bandwidth |
---|---|---|---|
L1 Cache | 32-64 KB | 1-4 cycles | 3+ TB/s |
L2 Cache | 256-512 KB | 10-20 cycles | 1+ TB/s |
L3 Cache | 8-32 MB | 30-70 cycles | 500+ GB/s |
Main Memory | 8-64 GB | 100-300 cycles | 50-100 GB/s |
The key insight: accessing data from cache is 100x faster than accessing main memory!
Cache Lines: The Unit of Transfer
- Memory is transferred in cache lines (typically 64 bytes)
- Loading one byte loads the entire 64-byte line
- This is why spatial locality matters so much
Sequential Access Pattern
// Optimal pattern - accesses memory consecutively for (int i = 0; i < N; i++) { sum += array[i]; }
Why Sequential Access is Fast
- Spatial Locality: After accessing
array[i]
, you'll likely needarray[i+1]
next - Cache Line Utilization: Uses all 64 bytes loaded into cache
- Prefetcher Friendly: Hardware detects pattern and loads ahead
- Minimal Cache Pollution: Doesn't evict useful data unnecessarily
Performance Characteristics
- Cache Hit Rate: 87.5% (7 hits per 8 accesses)
- Memory Bandwidth: Fully utilized
- Prefetcher: Highly effective
Strided Access Pattern
// Suboptimal pattern - jumps through memory for (int i = 0; i < N; i += stride) { sum += array[i]; }
Why Strided Access is Slow
- Poor Spatial Locality: Loads 64 bytes but uses only 4-8 bytes
- Cache Thrashing: Limited cache filled with mostly unused data
- Prefetcher Confusion: Unpredictable pattern defeats prefetching
- Bandwidth Waste: Transfers much more data than needed
Performance Characteristics
- Cache Hit Rate: ~12.5% or worse
- Memory Bandwidth: Poorly utilized
- Prefetcher: Ineffective or counterproductive
Real-World Examples
Matrix Operations
Row-Major Access (Good)
// Accessing matrix row by row for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { sum += matrix[i][j]; // Sequential in memory } }
Column-Major Access (Poor)
// Accessing matrix column by column for (int j = 0; j < cols; j++) { for (int i = 0; i < rows; i++) { sum += matrix[i][j]; // Strided by row size } }
Struct of Arrays vs Array of Structs
Array of Structs (AoS) - Poor for single field
struct Particle { float x, y, z; float vx, vy, vz; float mass; int type; }; Particle particles[N]; // Accessing only x coordinates - stride of 32 bytes for (int i = 0; i < N; i++) { avgX += particles[i].x; }
Struct of Arrays (SoA) - Good for single field
struct Particles { float x[N], y[N], z[N]; float vx[N], vy[N], vz[N]; float mass[N]; int type[N]; }; Particles particles; // Accessing x coordinates - sequential for (int i = 0; i < N; i++) { avgX += particles.x[i]; }
Hardware Prefetching
Modern CPUs include sophisticated prefetchers that:
- Detect Patterns: Sequential, stride, stream patterns
- Load Ahead: Bring data into cache before it's needed
- Multiple Prefetchers: L1, L2, and L3 each have prefetch units
- Adaptive: Learn and adjust to access patterns
Prefetcher-Friendly Patterns
- Sequential access
- Fixed stride (if not too large)
- Stream processing
- Linear array traversal
Prefetcher-Unfriendly Patterns
- Random access
- Large irregular strides
- Pointer chasing
- Hash table lookups
Optimization Techniques
1. Data Structure Layout
- Use contiguous arrays when possible
- Consider SoA for partial field access
- Align data to cache line boundaries
2. Algorithm Design
- Process data in cache-friendly order
- Block/tile algorithms for matrices
- Minimize working set size
3. Loop Transformations
// Loop interchange for (j = 0; j < M; j++) // Poor for (i = 0; i < N; i++) A[i][j] = B[i][j]; // Better for (i = 0; i < N; i++) for (j = 0; j < M; j++) A[i][j] = B[i][j];
4. Prefetch Instructions
// Manual prefetching for irregular patterns for (int i = 0; i < N; i++) { __builtin_prefetch(&data[index[i+8]], 0, 1); process(data[index[i]]); }
Measuring Access Patterns
Performance Counters
# Linux perf perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program # Intel VTune vtune -collect memory-access ./program
Key Metrics
- Cache Hit Rate: HitsTotal\Accesses × 100
- Memory Bandwidth: Bytes transferred per second
- Cache Line Utilization: Useful bytes / 64 bytes
- Prefetch Accuracy: Useful prefetches / Total prefetches
Best Practices
- Favor Sequential Access: Design data structures for linear traversal
- Minimize Stride: Keep related data close together
- Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
- Profile Real Workloads: Memory patterns vary by input
- Consider NUMA Effects: Memory access patterns affect NUMA systems differently
Related Concepts
Understanding memory access patterns connects to:
- Memory Interleaving: How addresses map to banks
- NUMA Architecture: Access patterns across NUMA nodes
- Cache Coherency: How shared data affects access patterns
- Memory Bandwidth: Theoretical vs achieved bandwidth
- Vectorization: SIMD instructions and memory patterns
Conclusion
Memory access patterns can make or break application performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching to achieve maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. By understanding and optimizing access patterns, you can often achieve 10x or greater performance improvements without changing your algorithm's complexity.
Related Concepts
Deepen your understanding with these interconnected concepts