Memory Access Patterns: Sequential vs Strided

Understand how different memory access patterns impact cache performance, prefetcher efficiency, and overall application speed through interactive visualizations.

Best viewed on desktop for optimal interactive experience

Understanding Memory Access Patterns

Memory access patterns are one of the most critical factors affecting application performance. The way your code accesses memory determines cache efficiency, memory bandwidth utilization, and whether hardware optimizations like prefetching can help accelerate your program.

Modern processors are designed to exploit predictable memory access patterns, making the difference between optimal and suboptimal patterns often 10x or more in real-world performance.

Interactive Memory Access Pattern Demo

Experience the dramatic performance difference between sequential and strided memory access patterns:

Memory Access Pattern Optimization

Access Pattern Demonstration

Sequential Access Pattern

for (i = 0; i < N; i++) {
  sum += array[i];
}

✓ Optimal cache utilization
✓ Prefetcher friendly
✓ Maximum memory bandwidth

Strided Access Pattern

for (i = 0; i < N; i += 8) {
  sum += array[i];
}

✗ Poor cache utilization
✗ Confuses prefetcher
✗ Wastes memory bandwidth

Access Sequence Visualization

Sequential:
[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
Strided    :
[0]
[8]
[16]
[24]
[32]
[40]
[48]
[56]

Sequential Access

L1 Cache (64 KB, 64B lines)

-
-
-
-
-
-
-
-

Hardware Prefetcher

Pattern detected!

Main Memory (DDR4)

Strided Access

L1 Cache (64 KB, 64B lines)

-
-
-
-
-
-
-
-

Hardware Prefetcher

Pattern unclear...

Main Memory (DDR4)

Sequential Hit Rate

0
%

Strided Hit Rate

0
%

Sequential BW

0.0
GB/s

Strided BW

0.0
GB/s
Cache Hit
Cache Miss
Prefetched
Currently Accessing

Key Insights:

  • Sequential access maximizes spatial locality - when you access element N, you'll likely need N+1 soon
  • Cache lines are 64 bytes, holding 8 doubles or 16 integers
  • Strided access wastes cache space by loading full lines but using only one element
  • Hardware prefetchers detect sequential patterns and load data before it's needed
  • • Large strides can cause cache thrashing where useful data is evicted before use

Why Access Patterns Matter

The Memory Hierarchy

Modern computers have a multi-level memory hierarchy:

LevelSizeLatencyBandwidth
L1 Cache32-64 KB1-4 cycles3+ TB/s
L2 Cache256-512 KB10-20 cycles1+ TB/s
L3 Cache8-32 MB30-70 cycles500+ GB/s
Main Memory8-64 GB100-300 cycles50-100 GB/s

The key insight: accessing data from cache is 100x faster than accessing main memory!

Cache Lines: The Unit of Transfer

  • Memory is transferred in cache lines (typically 64 bytes)
  • Loading one byte loads the entire 64-byte line
  • This is why spatial locality matters so much

Sequential Access Pattern

// Optimal pattern - accesses memory consecutively for (int i = 0; i < N; i++) { sum += array[i]; }

Why Sequential Access is Fast

  1. Spatial Locality: After accessing array[i], you'll likely need array[i+1] next
  2. Cache Line Utilization: Uses all 64 bytes loaded into cache
  3. Prefetcher Friendly: Hardware detects pattern and loads ahead
  4. Minimal Cache Pollution: Doesn't evict useful data unnecessarily

Performance Characteristics

  • Cache Hit Rate: 87.5% (7 hits per 8 accesses)
  • Memory Bandwidth: Fully utilized
  • Prefetcher: Highly effective

Strided Access Pattern

// Suboptimal pattern - jumps through memory for (int i = 0; i < N; i += stride) { sum += array[i]; }

Why Strided Access is Slow

  1. Poor Spatial Locality: Loads 64 bytes but uses only 4-8 bytes
  2. Cache Thrashing: Limited cache filled with mostly unused data
  3. Prefetcher Confusion: Unpredictable pattern defeats prefetching
  4. Bandwidth Waste: Transfers much more data than needed

Performance Characteristics

  • Cache Hit Rate: ~12.5% or worse
  • Memory Bandwidth: Poorly utilized
  • Prefetcher: Ineffective or counterproductive

Real-World Examples

Matrix Operations

Row-Major Access (Good)

// Accessing matrix row by row for (int i = 0; i < rows; i++) { for (int j = 0; j < cols; j++) { sum += matrix[i][j]; // Sequential in memory } }

Column-Major Access (Poor)

// Accessing matrix column by column for (int j = 0; j < cols; j++) { for (int i = 0; i < rows; i++) { sum += matrix[i][j]; // Strided by row size } }

Struct of Arrays vs Array of Structs

Array of Structs (AoS) - Poor for single field

struct Particle { float x, y, z; float vx, vy, vz; float mass; int type; }; Particle particles[N]; // Accessing only x coordinates - stride of 32 bytes for (int i = 0; i < N; i++) { avgX += particles[i].x; }

Struct of Arrays (SoA) - Good for single field

struct Particles { float x[N], y[N], z[N]; float vx[N], vy[N], vz[N]; float mass[N]; int type[N]; }; Particles particles; // Accessing x coordinates - sequential for (int i = 0; i < N; i++) { avgX += particles.x[i]; }

Hardware Prefetching

Modern CPUs include sophisticated prefetchers that:

  1. Detect Patterns: Sequential, stride, stream patterns
  2. Load Ahead: Bring data into cache before it's needed
  3. Multiple Prefetchers: L1, L2, and L3 each have prefetch units
  4. Adaptive: Learn and adjust to access patterns

Prefetcher-Friendly Patterns

  • Sequential access
  • Fixed stride (if not too large)
  • Stream processing
  • Linear array traversal

Prefetcher-Unfriendly Patterns

  • Random access
  • Large irregular strides
  • Pointer chasing
  • Hash table lookups

Optimization Techniques

1. Data Structure Layout

  • Use contiguous arrays when possible
  • Consider SoA for partial field access
  • Align data to cache line boundaries

2. Algorithm Design

  • Process data in cache-friendly order
  • Block/tile algorithms for matrices
  • Minimize working set size

3. Loop Transformations

// Loop interchange for (j = 0; j < M; j++) // Poor for (i = 0; i < N; i++) A[i][j] = B[i][j]; // Better for (i = 0; i < N; i++) for (j = 0; j < M; j++) A[i][j] = B[i][j];

4. Prefetch Instructions

// Manual prefetching for irregular patterns for (int i = 0; i < N; i++) { __builtin_prefetch(&data[index[i+8]], 0, 1); process(data[index[i]]); }

Measuring Access Patterns

Performance Counters

# Linux perf perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program # Intel VTune vtune -collect memory-access ./program

Key Metrics

  • Cache Hit Rate: HitsTotal\Accesses × 100
  • Memory Bandwidth: Bytes transferred per second
  • Cache Line Utilization: Useful bytes / 64 bytes
  • Prefetch Accuracy: Useful prefetches / Total prefetches

Best Practices

  1. Favor Sequential Access: Design data structures for linear traversal
  2. Minimize Stride: Keep related data close together
  3. Use Cache-Aware Algorithms: Block matrix multiply, tiled convolution
  4. Profile Real Workloads: Memory patterns vary by input
  5. Consider NUMA Effects: Memory access patterns affect NUMA systems differently

Understanding memory access patterns connects to:

  • Memory Interleaving: How addresses map to banks
  • NUMA Architecture: Access patterns across NUMA nodes
  • Cache Coherency: How shared data affects access patterns
  • Memory Bandwidth: Theoretical vs achieved bandwidth
  • Vectorization: SIMD instructions and memory patterns

Conclusion

Memory access patterns can make or break application performance. Sequential access leverages spatial locality, cache line transfers, and hardware prefetching to achieve maximum performance. Strided access wastes bandwidth, thrashes caches, and defeats optimization. By understanding and optimizing access patterns, you can often achieve 10x or greater performance improvements without changing your algorithm's complexity.

If you found this explanation helpful, consider sharing it with others.

Mastodon