CPU Cache Lines: The Unit of Memory Transfer
Explore how CPU cache lines work, understand spatial locality, and see why memory access patterns dramatically impact performance through interactive visualizations.
Best viewed on desktop for optimal interactive experience
CPU Cache Lines: The Unit of Memory Transfer
A cache line is the fundamental unit of data transfer between main memory and the CPU cache. When your CPU needs data from memory, it doesn't fetch just the bytes you requested—it fetches an entire cache line, typically 64 bytes on modern processors.
This design exploits spatial locality: if you access a memory location, you're likely to access nearby locations soon.
Performance Impact: The difference between cache-friendly and cache-unfriendly access patterns can be 10-100x!
Cache Line Concepts
Cache Line Size
64 bytes = 8 × 8-byte elements
Modern CPUs typically use 64-byte cache lines
Spatial Locality
Access one element, get 7 neighbors free!
CPU loads entire cache line on miss
Replacement Policy
LRU (Least Recently Used)
Oldest unused line gets evicted
Interactive Cache Line Visualization
See exactly how different memory access patterns interact with 64-byte cache lines:
Step 1 of 8
CPU Requests Address 0
Application wants to read 8 bytes (one 64-bit integer) at address 0
64-Byte Cache Line Structure
Memory Transfer Statistics
✓ Excellent!
Spatial Locality: Perfect! All loaded bytes used.
Memory Bandwidth: 100% utilized.
Performance: 7 free accesses per cache line!
Cost: 1 memory access → 8 elements available.
💡 Why 64-byte cache lines?
Memory transfers have high fixed cost (~100 cycles). Transferring 64 bytes barely costs more than 8 bytes!
Sequential access exploits spatial locality: nearby data accessed together.
Result: 1 slow memory access provides 8 fast cache accesses.
Why Cache Lines Exist
The Memory Hierarchy Gap
Modern CPUs execute instructions in less than a nanosecond, but accessing main memory takes 60-100 nanoseconds. That's a 100x difference!
Cache lines help bridge this gap by:
- Amortizing Memory Access Cost: Fetching 64 bytes takes barely more time than 8 bytes
- Exploiting Spatial Locality: Programs often access nearby data
- Enabling Prefetching: Hardware can predict and load future cache lines
- Maximizing Memory Bandwidth: Efficient use of memory bus width
Cache Line Size: Why 64 Bytes?
Modern x86-64 processors use 64-byte cache lines:
- 8 × 64-bit integers or doubles
- 16 × 32-bit integers or floats
- 64 × 8-bit characters
This size balances:
- Transfer efficiency: Larger lines amortize memory latency
- Cache pollution: Smaller lines waste less space on unused data
- False sharing: Larger lines increase conflict probability
Access Pattern Impact
Sequential Access (Best Case)
Efficiency: Near 100% of transferred bytes are used
- Uses all 8 elements per cache line
- Hardware prefetchers excel with sequential patterns
- Maximum memory bandwidth utilization
Strided Access (Poor)
Efficiency: Only 12.5% with stride-8
- Uses only 1 element per 64-byte cache line
- Prefetcher struggles with large strides
- 8x more memory traffic than necessary
Random Access (Worst Case)
Efficiency: Typically 1-2 elements per cache line
- Defeats spatial locality entirely
- Prefetcher cannot predict random patterns
- Maximum cache miss rate
Common Performance Issues
False Sharing
False sharing occurs when multiple threads access different data in the same cache line:
Problem: Each write invalidates the cache line for other cores, causing cache coherency traffic.
Solution: Pad shared data structures to align to cache line boundaries (64-byte alignment).
Data Structure Layout
Array of Structs (AoS): Poor when accessing single fields
- Each element may span multiple cache lines
- Wastes bandwidth on unused fields
Struct of Arrays (SoA): Good for single field access
- Sequential access to each field
- Full cache line utilization
Optimization Techniques
1. Loop Tiling/Blocking
Process data in cache-sized chunks to maximize reuse
2. Data Structure Padding
Align critical data to cache line boundaries to prevent false sharing
3. Algorithm Design
Choose algorithms that process data in cache-friendly order
4. Manual Prefetching
Use prefetch instructions for irregular access patterns
Measuring Cache Performance
Tools
- Linux perf:
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program - Intel VTune: Memory access analysis
Key Metrics
- L1 Cache Hit Rate: Should be >95% for good performance
- Cache Line Utilization: Bytes used / bytes transferred
- False Sharing: Watch for HITM (hit-modified) events
Best Practices
- Access Memory Sequentially: Design algorithms for linear access patterns
- Keep Working Sets Small: Fit hot data in L1/L2 cache
- Align Data Structures: Respect cache line boundaries
- Avoid False Sharing: Pad shared data structures to 64 bytes
- Profile Real Workloads: Cache behavior varies by data
Related Concepts
- Memory Access Patterns: How patterns affect cache efficiency
- Virtual Memory: Page-level memory management
- Memory Interleaving: Address mapping to banks
- NUMA Architecture: Cache coherency across sockets
Conclusion
Cache lines are fundamental to CPU performance. By understanding that memory moves in 64-byte chunks, you can write code that works with the hardware rather than against it. The 10-100x performance difference between cache-friendly and cache-unfriendly code makes this one of the most important optimizations in performance-critical applications.
Remember: Think in cache lines, not bytes!
Related Concepts
Deepen your understanding with these interconnected concepts
