CPU Cache Lines: The Unit of Memory Transfer
Explore how CPU cache lines work, understand spatial locality, and see why memory access patterns dramatically impact performance through interactive visualizations.
Best viewed on desktop for optimal interactive experience
What are CPU Cache Lines?
A cache line is the unit of data transfer between main memory and the CPU cache. When your CPU needs data from memory, it doesn't fetch just the bytes you requested—it fetches an entire cache line, typically 64 bytes on modern processors. This design exploits spatial locality: the principle that if you access a memory location, you're likely to access nearby locations soon.
Understanding cache lines is crucial for writing high-performance code. The difference between cache-friendly and cache-unfriendly access patterns can be 10-100x in performance!
Cache Line Concepts
Cache Line Size
64 bytes = 8 × 8-byte elements
Modern CPUs typically use 64-byte cache lines
Spatial Locality
Access one element, get 7 neighbors free!
CPU loads entire cache line on miss
Replacement Policy
LRU (Least Recently Used)
Oldest unused line gets evicted
Interactive Cache Line Visualization
Experience how different memory access patterns interact with CPU cache lines:
CPU Cache Lines Visualization
Cache Line Demonstration
Access Pattern: sequential
L1 Data Cache
Cache Line 0
Age: 0Cache Line 1
Age: 0Cache Line 2
Age: 0Cache Line 3
Age: 0Cache Size: 4 lines × 64 bytes = 256 bytes
Main Memory (RAM)
Memory divided into 8 cache-line-sized blocks
Cache Hits
Cache Misses
Hit Rate
Bytes Transferred
Efficiency
Access Pattern Analysis
Current Pattern: sequential
✓ Excellent cache line utilization
Accesses elements 0, 1, 2, 3... in order. Each cache line load brings 8 useful elements.
for (i=0; i<N; i++) array[i]
Performance Impact
Key Insights
Key Insights:
• Cache lines are the unit of data transfer between memory and cache (typically 64 bytes)
• Spatial locality means accessing nearby data is essentially free if it's in the same cache line
• Sequential access maximizes cache line utilization (8 useful elements per load)
• Strided access can waste 87.5% of transferred data with stride-8 pattern
• Random access defeats spatial locality, causing frequent cache misses
• False sharing occurs when multiple threads access different data in the same cache line
Why Cache Lines Matter
The Memory Hierarchy Gap
Modern CPUs can execute instructions in less than a nanosecond, but accessing main memory takes 60-100 nanoseconds. That's a 100x difference! Cache lines help bridge this gap by:
- Amortizing Memory Access Cost: Fetching 64 bytes takes barely more time than fetching 8 bytes
- Exploiting Spatial Locality: Programs often access nearby data
- Enabling Prefetching: Hardware can predict and load future cache lines
- Maximizing Memory Bandwidth: Efficient use of memory bus width
Cache Line Size
Modern x86-64 processors use 64-byte cache lines, which means:
- 8 × 64-bit integers or doubles
- 16 × 32-bit integers or floats
- 64 × 8-bit characters
This size is carefully chosen to balance:
- Transfer efficiency: Larger lines amortize memory latency
- Cache pollution: Smaller lines waste less space on unused data
- False sharing: Larger lines increase conflict probability
Memory Access Patterns
Sequential Access (Best Case)
// Excellent cache line utilization for (int i = 0; i < N; i++) { sum += array[i]; // Uses all 8 elements per cache line }
- Efficiency: Near 100% of transferred bytes are used
- Prefetcher: Hardware prefetchers love sequential patterns
- Performance: Maximum memory bandwidth utilization
Strided Access (Poor)
// Wastes 87.5% of each cache line with stride-8 for (int i = 0; i < N; i += 8) { sum += array[i]; // Uses only 1 element per cache line }
- Efficiency: Only 12.5% of transferred bytes used
- Prefetcher: May struggle with large strides
- Performance: 8x more memory traffic than necessary
Random Access (Worst Case)
// Defeats spatial locality entirely for (int i = 0; i < N; i++) { sum += array[random_index()]; }
- Efficiency: Typically uses 1-2 elements per cache line
- Prefetcher: Cannot predict random patterns
- Performance: Maximum cache miss rate
Cache Replacement Policies
When the cache is full, the CPU must decide which cache line to evict. Common policies include:
LRU (Least Recently Used)
Most common in modern CPUs:
- Tracks usage order of cache lines
- Evicts the oldest unused line
- Good for most workloads
- Some overhead for tracking
Pseudo-LRU
Approximates LRU with less overhead:
- Uses bits to track approximate age
- Not perfectly accurate but faster
- Common in L3 caches
Performance Implications
Data Structure Layout
Array of Structures (AoS)
struct Point { float x, y, z; float padding[5]; // 32 bytes total }; Point points[1000]; // Poor: Each point spans half a cache line for (int i = 0; i < 1000; i++) { sum += points[i].x; // Wastes 31 bytes per access }
Structure of Arrays (SoA)
struct Points { float x[1000]; float y[1000]; float z[1000]; }; Points points; // Good: Sequential access, full cache line utilization for (int i = 0; i < 1000; i++) { sum += points.x[i]; // Uses all 64 bytes }
False Sharing
False sharing occurs when multiple threads access different data in the same cache line:
// BAD: False sharing struct Counter { int thread1_count; // Byte 0-3 int thread2_count; // Byte 4-7 (same cache line!) }; // GOOD: Padding prevents false sharing struct Counter { alignas(64) int thread1_count; // Own cache line alignas(64) int thread2_count; // Own cache line };
Optimization Techniques
1. Loop Tiling/Blocking
Process data in cache-sized chunks:
// Process matrix in 8×8 blocks (fits in cache) for (int ii = 0; ii < N; ii += 8) { for (int jj = 0; jj < N; jj += 8) { for (int i = ii; i < ii + 8; i++) { for (int j = jj; j < jj + 8; j++) { C[i][j] += A[i][k] * B[k][j]; } } } }
2. Prefetching
Manual prefetching for irregular patterns:
for (int i = 0; i < N; i++) { __builtin_prefetch(&data[indices[i + 8]], 0, 1); process(data[indices[i]]); }
3. Data Structure Padding
Align critical data to cache line boundaries:
struct alignas(64) CriticalData { int important_value; char padding[60]; // Fill cache line };
Measuring Cache Performance
Linux perf
# Measure cache misses perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program # Detailed cache analysis perf stat -d ./program
Key Metrics
- L1 Cache Hit Rate: Should be >95% for good performance
- Cache Line Utilization: Bytes used / bytes transferred
- False Sharing: Watch for high HITM (hit-modified) events
Best Practices
- Access Memory Sequentially: Design algorithms for linear access
- Keep Working Sets Small: Fit hot data in L1/L2 cache
- Align Data Structures: Respect cache line boundaries
- Avoid False Sharing: Pad shared data structures
- Profile Real Workloads: Cache behavior varies by data
Related Concepts
Understanding cache lines connects to:
- Memory Access Patterns: How access patterns affect cache efficiency
- NUMA Architecture: Cache coherency across sockets
- Memory Interleaving: How addresses map to memory banks
- Cache Coherency: MESI protocol and multi-core caching
- Prefetching: Hardware and software prefetch strategies
Conclusion
Cache lines are fundamental to CPU performance. By understanding that memory moves in 64-byte chunks, you can write code that works with the hardware rather than against it. The 10-100x performance difference between cache-friendly and cache-unfriendly code makes this one of the most important optimizations in performance-critical applications.
Remember: Think in cache lines, not bytes!
Related Concepts
Deepen your understanding with these interconnected concepts