Memory Access Patterns: Sequential vs Strided

Understand how different memory access patterns impact cache performance, prefetcher efficiency, and overall application speed through interactive visualizations.

Best viewed on desktop for optimal interactive experience

Memory Access Patterns: Sequential vs Strided

Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help.

Key Insight: The difference between optimal and suboptimal patterns can be 10x or more in performance!

Interactive Visualization

Experience the dramatic performance difference between sequential and strided memory access patterns:

Step 1 of 9

CPU requests address 0

Application needs data at memory address 0

CPU

ACTIVE
Processing at ~3 GHz (1 cycle ≈ 0.3ns)

L1 Cache

32-64 KB, ~4 cycles latency

Hardware Prefetcher

Monitoring access patterns...

Main Memory (DRAM)

~100 cycles latency, 64-byte cache lines
Cache Line 0
0
1
2
3
4
5
6
7
Cache Line 1
8
9
10
11
12
13
14
15
Cache Line 2
16
17
18
19
20
21
22
23
Cache Line 3
24
25
26
27
28
29
30
31
Cache Line 4
32
33
34
35
36
37
38
39
Cache Line 5
40
41
42
43
44
45
46
47
Cache Line 6
48
49
50
51
52
53
54
55
Cache Line 7
56
57
58
59
60
61
62
63
Accessed & Used
Loaded but Wasted

✓ Excellent Performance

Cache Hit Rate: ~87.5% (7 hits per 8 accesses)

Memory Bandwidth: ~100% utilization - all loaded bytes used

Prefetcher: Highly effective - loads ahead predictively

Performance: Near-optimal memory access speed

Why Access Patterns Matter

The Memory Hierarchy Gap

Modern computers have a multi-level memory hierarchy:

LevelSizeLatencyBandwidth
L1 Cache32-64 KB1-4 cycles3+ TB/s
L2 Cache256-512 KB10-20 cycles1+ TB/s
L3 Cache8-32 MB30-70 cycles500+ GB/s
Main Memory8-64 GB100-300 cycles50-100 GB/s

The Gap: Accessing data from cache is 100x faster than main memory!

Cache Lines: The Unit of Transfer

  • Memory transfers in 64-byte cache lines
  • Loading one byte loads the entire 64-byte line
  • Spatial locality determines whether those 64 bytes are useful

Sequential vs Strided Access

Sequential Access (Optimal)

Pattern: Access consecutive memory locations

Why It's Fast:

  • Spatial Locality: Uses all 64 bytes loaded per cache line
  • Cache Hit Rate: ~87.5% (7 hits per 8 accesses)
  • Prefetcher Friendly: Hardware predicts and loads ahead
  • Bandwidth Efficient: Every byte transferred is used

Strided Access (Suboptimal)

Pattern: Jump through memory with fixed stride

Why It's Slow:

  • Poor Locality: Loads 64 bytes, uses only a few
  • Cache Thrashing: Fills cache with mostly unused data
  • Prefetcher Confusion: Pattern defeats prefetching
  • Bandwidth Waste: Can waste 87.5% of transferred data (stride-8)

Common Patterns

Matrix Operations

Row-Major (Good): Sequential in memory Column-Major (Poor): Strided by row width

Data Structure Layout

Array of Structs (AoS): Strided when accessing single fields Struct of Arrays (SoA): Sequential for single field access

Hardware Prefetching

Modern CPUs include sophisticated prefetchers:

What They Do:

  1. Detect access patterns (sequential, stride, stream)
  2. Load data into cache before it's needed
  3. Multiple prefetch units (L1, L2, L3)
  4. Adaptive learning of patterns

Prefetcher-Friendly:

  • Sequential access
  • Fixed stride (if not too large)
  • Stream processing
  • Linear traversal

Prefetcher-Unfriendly:

  • Random access
  • Large irregular strides
  • Pointer chasing
  • Hash table lookups

Optimization Strategies

Data Structure Design

  • Use contiguous arrays when possible
  • Consider SoA for partial field access
  • Align critical data to cache line boundaries

Algorithm Design

  • Process data in cache-friendly order
  • Block/tile algorithms for matrices
  • Minimize working set size

Loop Optimization

  • Interchange loops for sequential access
  • Tile/block for cache locality
  • Consider manual prefetch for irregular patterns

Measuring Performance

Key Metrics

  • Cache Hit Rate: HitsTotal\Accesses × 100
  • Memory Bandwidth: Bytes transferred per second
  • Cache Line Utilization: Useful bytes / 64 bytes
  • Prefetch Accuracy: Useful prefetches / Total prefetches

Tools

  • Linux perf: perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program
  • Intel VTune: vtune -collect memory-access ./program

Best Practices

  1. Design for Sequential Access: Arrange data structures for linear traversal
  2. Minimize Stride: Keep related data close together
  3. Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
  4. Profile Real Workloads: Memory patterns vary by input
  5. Consider NUMA Effects: Access patterns affect NUMA systems differently

Conclusion

Memory access patterns can make or break performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching for maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. Understanding these patterns through visual exploration enables 10x+ performance improvements without algorithmic changes.

If you found this explanation helpful, consider sharing it with others.

Mastodon