CPU Cache Lines: The Unit of Memory Transfer

Explore how CPU cache lines work, understand spatial locality, and see why memory access patterns dramatically impact performance through interactive visualizations.

Best viewed on desktop for optimal interactive experience

What are CPU Cache Lines?

A cache line is the unit of data transfer between main memory and the CPU cache. When your CPU needs data from memory, it doesn't fetch just the bytes you requested—it fetches an entire cache line, typically 64 bytes on modern processors. This design exploits spatial locality: the principle that if you access a memory location, you're likely to access nearby locations soon.

Understanding cache lines is crucial for writing high-performance code. The difference between cache-friendly and cache-unfriendly access patterns can be 10-100x in performance!

Cache Line Concepts

Cache Line Size

64 bytes = 8 × 8-byte elements

Modern CPUs typically use 64-byte cache lines

Spatial Locality

Access one element, get 7 neighbors free!

CPU loads entire cache line on miss

Replacement Policy

LRU (Least Recently Used)

Oldest unused line gets evicted

Interactive Cache Line Visualization

Experience how different memory access patterns interact with CPU cache lines:

CPU Cache Lines Visualization

Cache Line Demonstration

Access Pattern: sequential

[0]
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]

L1 Data Cache

Cache Line 0

Age: 0
Empty

Cache Line 1

Age: 0
Empty

Cache Line 2

Age: 0
Empty

Cache Line 3

Age: 0
Empty

Cache Size: 4 lines × 64 bytes = 256 bytes

Main Memory (RAM)

0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63

Memory divided into 8 cache-line-sized blocks

Cache Hits

0
hits

Cache Misses

0
misses

Hit Rate

0
%

Bytes Transferred

0
bytes

Efficiency

0
%

Access Pattern Analysis

Current Pattern: sequential

✓ Excellent cache line utilization

Accesses elements 0, 1, 2, 3... in order. Each cache line load brings 8 useful elements.

for (i=0; i<N; i++) array[i]

Performance Impact

Cache Lines Used:0 / 4
Evictions:0
Bytes Wasted:0 (0%)

Key Insights

Key Insights:

Cache lines are the unit of data transfer between memory and cache (typically 64 bytes)

Spatial locality means accessing nearby data is essentially free if it's in the same cache line

Sequential access maximizes cache line utilization (8 useful elements per load)

Strided access can waste 87.5% of transferred data with stride-8 pattern

Random access defeats spatial locality, causing frequent cache misses

False sharing occurs when multiple threads access different data in the same cache line

Why Cache Lines Matter

The Memory Hierarchy Gap

Modern CPUs can execute instructions in less than a nanosecond, but accessing main memory takes 60-100 nanoseconds. That's a 100x difference! Cache lines help bridge this gap by:

  1. Amortizing Memory Access Cost: Fetching 64 bytes takes barely more time than fetching 8 bytes
  2. Exploiting Spatial Locality: Programs often access nearby data
  3. Enabling Prefetching: Hardware can predict and load future cache lines
  4. Maximizing Memory Bandwidth: Efficient use of memory bus width

Cache Line Size

Modern x86-64 processors use 64-byte cache lines, which means:

  • 8 × 64-bit integers or doubles
  • 16 × 32-bit integers or floats
  • 64 × 8-bit characters

This size is carefully chosen to balance:

  • Transfer efficiency: Larger lines amortize memory latency
  • Cache pollution: Smaller lines waste less space on unused data
  • False sharing: Larger lines increase conflict probability

Memory Access Patterns

Sequential Access (Best Case)

// Excellent cache line utilization for (int i = 0; i < N; i++) { sum += array[i]; // Uses all 8 elements per cache line }
  • Efficiency: Near 100% of transferred bytes are used
  • Prefetcher: Hardware prefetchers love sequential patterns
  • Performance: Maximum memory bandwidth utilization

Strided Access (Poor)

// Wastes 87.5% of each cache line with stride-8 for (int i = 0; i < N; i += 8) { sum += array[i]; // Uses only 1 element per cache line }
  • Efficiency: Only 12.5% of transferred bytes used
  • Prefetcher: May struggle with large strides
  • Performance: 8x more memory traffic than necessary

Random Access (Worst Case)

// Defeats spatial locality entirely for (int i = 0; i < N; i++) { sum += array[random_index()]; }
  • Efficiency: Typically uses 1-2 elements per cache line
  • Prefetcher: Cannot predict random patterns
  • Performance: Maximum cache miss rate

Cache Replacement Policies

When the cache is full, the CPU must decide which cache line to evict. Common policies include:

LRU (Least Recently Used)

Most common in modern CPUs:

  • Tracks usage order of cache lines
  • Evicts the oldest unused line
  • Good for most workloads
  • Some overhead for tracking

Pseudo-LRU

Approximates LRU with less overhead:

  • Uses bits to track approximate age
  • Not perfectly accurate but faster
  • Common in L3 caches

Performance Implications

Data Structure Layout

Array of Structures (AoS)

struct Point { float x, y, z; float padding[5]; // 32 bytes total }; Point points[1000]; // Poor: Each point spans half a cache line for (int i = 0; i < 1000; i++) { sum += points[i].x; // Wastes 31 bytes per access }

Structure of Arrays (SoA)

struct Points { float x[1000]; float y[1000]; float z[1000]; }; Points points; // Good: Sequential access, full cache line utilization for (int i = 0; i < 1000; i++) { sum += points.x[i]; // Uses all 64 bytes }

False Sharing

False sharing occurs when multiple threads access different data in the same cache line:

// BAD: False sharing struct Counter { int thread1_count; // Byte 0-3 int thread2_count; // Byte 4-7 (same cache line!) }; // GOOD: Padding prevents false sharing struct Counter { alignas(64) int thread1_count; // Own cache line alignas(64) int thread2_count; // Own cache line };

Optimization Techniques

1. Loop Tiling/Blocking

Process data in cache-sized chunks:

// Process matrix in 8×8 blocks (fits in cache) for (int ii = 0; ii < N; ii += 8) { for (int jj = 0; jj < N; jj += 8) { for (int i = ii; i < ii + 8; i++) { for (int j = jj; j < jj + 8; j++) { C[i][j] += A[i][k] * B[k][j]; } } } }

2. Prefetching

Manual prefetching for irregular patterns:

for (int i = 0; i < N; i++) { __builtin_prefetch(&data[indices[i + 8]], 0, 1); process(data[indices[i]]); }

3. Data Structure Padding

Align critical data to cache line boundaries:

struct alignas(64) CriticalData { int important_value; char padding[60]; // Fill cache line };

Measuring Cache Performance

Linux perf

# Measure cache misses perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program # Detailed cache analysis perf stat -d ./program

Key Metrics

  • L1 Cache Hit Rate: Should be >95% for good performance
  • Cache Line Utilization: Bytes used / bytes transferred
  • False Sharing: Watch for high HITM (hit-modified) events

Best Practices

  1. Access Memory Sequentially: Design algorithms for linear access
  2. Keep Working Sets Small: Fit hot data in L1/L2 cache
  3. Align Data Structures: Respect cache line boundaries
  4. Avoid False Sharing: Pad shared data structures
  5. Profile Real Workloads: Cache behavior varies by data

Understanding cache lines connects to:

Conclusion

Cache lines are fundamental to CPU performance. By understanding that memory moves in 64-byte chunks, you can write code that works with the hardware rather than against it. The 10-100x performance difference between cache-friendly and cache-unfriendly code makes this one of the most important optimizations in performance-critical applications.

Remember: Think in cache lines, not bytes!

If you found this explanation helpful, consider sharing it with others.

Mastodon