CPU Cache Lines: The Unit of Memory Transfer

A cache line is the fundamental unit of data transfer between main memory and the CPU cache. When your CPU needs data from memory, it doesn't fetch just the bytes you requested—it fetches an entire cache line, typically 64 bytes on modern processors.

This design exploits spatial locality: if you access a memory location, you're likely to access nearby locations soon.

Performance Impact: The difference between cache-friendly and cache-unfriendly access patterns can be 10-100x!

Cache Line Concepts

Cache Line Size

64 bytes = 8 × 8-byte elements

Modern CPUs typically use 64-byte cache lines

Spatial Locality

Access one element, get 7 neighbors free!

CPU loads entire cache line on miss

Replacement Policy

LRU (Least Recently Used)

Oldest unused line gets evicted

Interactive Cache Line Visualization

See exactly how different memory access patterns interact with 64-byte cache lines:

Step 1 of 8

CPU Requests Address 0

Application wants to read 8 bytes (one 64-bit integer) at address 0

64-Byte Cache Line Structure

8 elements × 8 bytes = 64 bytes total

Memory Transfer Statistics

Bytes Transferred:0

Bytes Used:0

Bytes Wasted:0

Cache Line Utilization:0%

✓ Excellent!

Spatial Locality: Perfect! All loaded bytes used.

Memory Bandwidth: 100% utilized.

Performance: 7 free accesses per cache line!

Cost: 1 memory access → 8 elements available.

💡 Why 64-byte cache lines?

Memory transfers have high fixed cost (~100 cycles). Transferring 64 bytes barely costs more than 8 bytes!

Sequential access exploits spatial locality: nearby data accessed together.

Result: 1 slow memory access provides 8 fast cache accesses.

Why Cache Lines Exist

The Memory Hierarchy Gap

Modern CPUs execute instructions in less than a nanosecond, but accessing main memory takes 60-100 nanoseconds. That's a 100x difference!

Cache lines help bridge this gap by:

Amortizing Memory Access Cost: Fetching 64 bytes takes barely more time than 8 bytes
Exploiting Spatial Locality: Programs often access nearby data
Enabling Prefetching: Hardware can predict and load future cache lines
Maximizing Memory Bandwidth: Efficient use of memory bus width

Cache Line Size: Why 64 Bytes?

Modern x86-64 processors use 64-byte cache lines:

8 × 64-bit integers or doubles
16 × 32-bit integers or floats
64 × 8-bit characters

This size balances:

Transfer efficiency: Larger lines amortize memory latency
Cache pollution: Smaller lines waste less space on unused data
False sharing: Larger lines increase conflict probability

Access Pattern Impact

Sequential Access (Best Case)

Efficiency: Near 100% of transferred bytes are used

Uses all 8 elements per cache line
Hardware prefetchers excel with sequential patterns
Maximum memory bandwidth utilization

Strided Access (Poor)

Efficiency: Only 12.5% with stride-8

Uses only 1 element per 64-byte cache line
Prefetcher struggles with large strides
8x more memory traffic than necessary

Random Access (Worst Case)

Efficiency: Typically 1-2 elements per cache line

Defeats spatial locality entirely
Prefetcher cannot predict random patterns
Maximum cache miss rate

Common Performance Issues

False sharing occurs when multiple threads access different data in the same cache line:

Problem: Each write invalidates the cache line for other cores, causing cache coherency traffic.

Solution: Pad shared data structures to align to cache line boundaries (64-byte alignment).

Data Structure Layout

Array of Structs (AoS): Poor when accessing single fields

Each element may span multiple cache lines
Wastes bandwidth on unused fields

Struct of Arrays (SoA): Good for single field access

Sequential access to each field
Full cache line utilization

Optimization Techniques

1. Loop Tiling/Blocking

Process data in cache-sized chunks to maximize reuse

2. Data Structure Padding

Align critical data to cache line boundaries to prevent false sharing

3. Algorithm Design

Choose algorithms that process data in cache-friendly order

4. Manual Prefetching

Use prefetch instructions for irregular access patterns

Measuring Cache Performance

Tools

Linux perf: perf stat -e L1-dcache-load-misses,L1-dcache-loads ./program
Intel VTune: Memory access analysis

Key Metrics

L1 Cache Hit Rate: Should be >95% for good performance
Cache Line Utilization: Bytes used / bytes transferred
False Sharing: Watch for HITM (hit-modified) events

Best Practices

Access Memory Sequentially: Design algorithms for linear access patterns
Keep Working Sets Small: Fit hot data in L1/L2 cache
Align Data Structures: Respect cache line boundaries
Avoid False Sharing: Pad shared data structures to 64 bytes
Profile Real Workloads: Cache behavior varies by data

Memory Access Patterns: How patterns affect cache efficiency
Virtual Memory: Page-level memory management
Memory Interleaving: Address mapping to banks
NUMA Architecture: Cache coherency across sockets

Conclusion

Cache lines are fundamental to CPU performance. By understanding that memory moves in 64-byte chunks, you can write code that works with the hardware rather than against it. The 10-100x performance difference between cache-friendly and cache-unfriendly code makes this one of the most important optimizations in performance-critical applications.

Remember: Think in cache lines, not bytes!

CPU Cache Lines: The Unit of Memory Transfer

Table of Contents

CPU Cache Lines: The Unit of Memory Transfer

Cache Line Concepts

Cache Line Size

Spatial Locality

Replacement Policy

Interactive Cache Line Visualization

Step 1 of 8

CPU Requests Address 0

64-Byte Cache Line Structure

Memory Transfer Statistics

✓ Excellent!

💡 Why 64-byte cache lines?

Why Cache Lines Exist

The Memory Hierarchy Gap

Cache Line Size: Why 64 Bytes?

Access Pattern Impact

Sequential Access (Best Case)

Strided Access (Poor)

Random Access (Worst Case)

Common Performance Issues

Data Structure Layout

Optimization Techniques

1. Loop Tiling/Blocking

2. Data Structure Padding

3. Algorithm Design

4. Manual Prefetching

Measuring Cache Performance

Tools

Key Metrics

Best Practices

Conclusion

Related Concepts

Memory Access Patterns

Memory Interleaving

NUMA Architecture