Memory Access Patterns: Sequential vs Strided

Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help.

Key Insight: The difference between optimal and suboptimal patterns can be 10x or more in performance!

Interactive Visualization

Experience the dramatic performance difference between sequential and strided memory access patterns:

Step 1 of 9

CPU requests address 0

Application needs data at memory address 0

CPU

ACTIVE

Processing at ~3 GHz (1 cycle ≈ 0.3ns)

L1 Cache

32-64 KB, ~4 cycles latency

Hardware Prefetcher

Monitoring access patterns...

Main Memory (DRAM)

~100 cycles latency, 64-byte cache lines

Cache Line 0

Cache Line 1

Cache Line 2

Cache Line 3

Cache Line 4

Cache Line 5

Cache Line 6

Cache Line 7

Accessed & Used

Loaded but Wasted

✓ Excellent Performance

Cache Hit Rate: ~87.5% (7 hits per 8 accesses)

Memory Bandwidth: ~100% utilization - all loaded bytes used

Prefetcher: Highly effective - loads ahead predictively

Performance: Near-optimal memory access speed

Why Access Patterns Matter

The Memory Hierarchy Gap

Modern computers have a multi-level memory hierarchy:

Level	Size	Latency	Bandwidth
L1 Cache	32-64 KB	1-4 cycles	3+ TB/s
L2 Cache	256-512 KB	10-20 cycles	1+ TB/s
L3 Cache	8-32 MB	30-70 cycles	500+ GB/s
Main Memory	8-64 GB	100-300 cycles	50-100 GB/s

The Gap: Accessing data from cache is 100x faster than main memory!

Cache Lines: The Unit of Transfer

Memory transfers in 64-byte cache lines
Loading one byte loads the entire 64-byte line
Spatial locality determines whether those 64 bytes are useful

Sequential vs Strided Access

Sequential Access (Optimal)

Pattern: Access consecutive memory locations

Why It's Fast:

Spatial Locality: Uses all 64 bytes loaded per cache line
Cache Hit Rate: ~87.5% (7 hits per 8 accesses)
Prefetcher Friendly: Hardware predicts and loads ahead
Bandwidth Efficient: Every byte transferred is used

Strided Access (Suboptimal)

Pattern: Jump through memory with fixed stride

Why It's Slow:

Poor Locality: Loads 64 bytes, uses only a few
Cache Thrashing: Fills cache with mostly unused data
Prefetcher Confusion: Pattern defeats prefetching
Bandwidth Waste: Can waste 87.5% of transferred data (stride-8)

Common Patterns

Matrix Operations

Row-Major (Good): Sequential in memory Column-Major (Poor): Strided by row width

Data Structure Layout

Array of Structs (AoS): Strided when accessing single fields Struct of Arrays (SoA): Sequential for single field access

Hardware Prefetching

Modern CPUs include sophisticated prefetchers:

What They Do:

Detect access patterns (sequential, stride, stream)
Load data into cache before it's needed
Multiple prefetch units (L1, L2, L3)
Adaptive learning of patterns

Prefetcher-Friendly:

Sequential access
Fixed stride (if not too large)
Stream processing
Linear traversal

Prefetcher-Unfriendly:

Random access
Large irregular strides
Pointer chasing
Hash table lookups

Optimization Strategies

Data Structure Design

Use contiguous arrays when possible
Consider SoA for partial field access
Align critical data to cache line boundaries

Algorithm Design

Process data in cache-friendly order
Block/tile algorithms for matrices
Minimize working set size

Loop Optimization

Interchange loops for sequential access
Tile/block for cache locality
Consider manual prefetch for irregular patterns

Measuring Performance

Key Metrics

Cache Hit Rate: HitsTotal\_Accesses × 100
Memory Bandwidth: Bytes transferred per second
Cache Line Utilization: Useful bytes / 64 bytes
Prefetch Accuracy: Useful prefetches / Total prefetches

Tools

Linux perf: perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program
Intel VTune: vtune -collect memory-access ./program

Best Practices

Design for Sequential Access: Arrange data structures for linear traversal
Minimize Stride: Keep related data close together
Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
Profile Real Workloads: Memory patterns vary by input
Consider NUMA Effects: Access patterns affect NUMA systems differently

CPU Cache Lines: How cache lines work
Memory Interleaving: Address mapping to banks
NUMA Architecture: Access patterns across nodes
Virtual Memory: Page-level access patterns

Conclusion

Memory access patterns can make or break performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching for maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. Understanding these patterns through visual exploration enables 10x+ performance improvements without algorithmic changes.

Memory Access Patterns: Sequential vs Strided

Table of Contents

Memory Access Patterns: Sequential vs Strided

Interactive Visualization

Step 1 of 9

CPU requests address 0

CPU

L1 Cache

Hardware Prefetcher

Main Memory (DRAM)

✓ Excellent Performance

Why Access Patterns Matter

The Memory Hierarchy Gap

Cache Lines: The Unit of Transfer

Sequential vs Strided Access

Sequential Access (Optimal)

Strided Access (Suboptimal)

Common Patterns

Matrix Operations

Data Structure Layout

Hardware Prefetching

Optimization Strategies

Data Structure Design

Algorithm Design

Loop Optimization

Measuring Performance

Key Metrics

Tools

Best Practices

Conclusion

Related Concepts

CPU Cache Lines

Memory Interleaving

NUMA Architecture