Memory Access Patterns: Sequential vs Strided
Understand how different memory access patterns impact cache performance, prefetcher efficiency, and overall application speed through interactive visualizations.
Best viewed on desktop for optimal interactive experience
Memory Access Patterns: Sequential vs Strided
Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help.
Key Insight: The difference between optimal and suboptimal patterns can be 10x or more in performance!
Interactive Visualization
Experience the dramatic performance difference between sequential and strided memory access patterns:
Step 1 of 9
CPU requests address 0
Application needs data at memory address 0
CPU
L1 Cache
Hardware Prefetcher
Main Memory (DRAM)
✓ Excellent Performance
Cache Hit Rate: ~87.5% (7 hits per 8 accesses)
Memory Bandwidth: ~100% utilization - all loaded bytes used
Prefetcher: Highly effective - loads ahead predictively
Performance: Near-optimal memory access speed
Why Access Patterns Matter
The Memory Hierarchy Gap
Modern computers have a multi-level memory hierarchy:
| Level | Size | Latency | Bandwidth |
|---|---|---|---|
| L1 Cache | 32-64 KB | 1-4 cycles | 3+ TB/s |
| L2 Cache | 256-512 KB | 10-20 cycles | 1+ TB/s |
| L3 Cache | 8-32 MB | 30-70 cycles | 500+ GB/s |
| Main Memory | 8-64 GB | 100-300 cycles | 50-100 GB/s |
The Gap: Accessing data from cache is 100x faster than main memory!
Cache Lines: The Unit of Transfer
- Memory transfers in 64-byte cache lines
- Loading one byte loads the entire 64-byte line
- Spatial locality determines whether those 64 bytes are useful
Sequential vs Strided Access
Sequential Access (Optimal)
Pattern: Access consecutive memory locations
Why It's Fast:
- Spatial Locality: Uses all 64 bytes loaded per cache line
- Cache Hit Rate: ~87.5% (7 hits per 8 accesses)
- Prefetcher Friendly: Hardware predicts and loads ahead
- Bandwidth Efficient: Every byte transferred is used
Strided Access (Suboptimal)
Pattern: Jump through memory with fixed stride
Why It's Slow:
- Poor Locality: Loads 64 bytes, uses only a few
- Cache Thrashing: Fills cache with mostly unused data
- Prefetcher Confusion: Pattern defeats prefetching
- Bandwidth Waste: Can waste 87.5% of transferred data (stride-8)
Common Patterns
Matrix Operations
Row-Major (Good): Sequential in memory Column-Major (Poor): Strided by row width
Data Structure Layout
Array of Structs (AoS): Strided when accessing single fields Struct of Arrays (SoA): Sequential for single field access
Hardware Prefetching
Modern CPUs include sophisticated prefetchers:
What They Do:
- Detect access patterns (sequential, stride, stream)
- Load data into cache before it's needed
- Multiple prefetch units (L1, L2, L3)
- Adaptive learning of patterns
Prefetcher-Friendly:
- Sequential access
- Fixed stride (if not too large)
- Stream processing
- Linear traversal
Prefetcher-Unfriendly:
- Random access
- Large irregular strides
- Pointer chasing
- Hash table lookups
Optimization Strategies
Data Structure Design
- Use contiguous arrays when possible
- Consider SoA for partial field access
- Align critical data to cache line boundaries
Algorithm Design
- Process data in cache-friendly order
- Block/tile algorithms for matrices
- Minimize working set size
Loop Optimization
- Interchange loops for sequential access
- Tile/block for cache locality
- Consider manual prefetch for irregular patterns
Measuring Performance
Key Metrics
- Cache Hit Rate: HitsTotal\Accesses × 100
- Memory Bandwidth: Bytes transferred per second
- Cache Line Utilization: Useful bytes / 64 bytes
- Prefetch Accuracy: Useful prefetches / Total prefetches
Tools
- Linux perf:
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program - Intel VTune:
vtune -collect memory-access ./program
Best Practices
- Design for Sequential Access: Arrange data structures for linear traversal
- Minimize Stride: Keep related data close together
- Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
- Profile Real Workloads: Memory patterns vary by input
- Consider NUMA Effects: Access patterns affect NUMA systems differently
Related Concepts
- CPU Cache Lines: How cache lines work
- Memory Interleaving: Address mapping to banks
- NUMA Architecture: Access patterns across nodes
- Virtual Memory: Page-level access patterns
Conclusion
Memory access patterns can make or break performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching for maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. Understanding these patterns through visual exploration enables 10x+ performance improvements without algorithmic changes.
Related Concepts
Deepen your understanding with these interconnected concepts
