TLB: How CPUs Translate Virtual to Physical Memory

What is a TLB?

The Translation Lookaside Buffer (TLB) is a specialized cache that stores recent translations from virtual memory addresses to physical memory addresses. Without TLBs, every memory access would require multiple additional memory lookups just to find where data actually lives in RAM - making programs 5-10x slower!

Think of the TLB as a GPS cache: instead of calculating the route from your address to a destination every time, it remembers recent routes. When you visit the same places frequently (which programs do with memory), this cache saves enormous amounts of time.

Why Do We Need Virtual Memory?

Before understanding TLBs, let's understand the problem they solve:

The Virtual Memory Challenge

Modern operating systems use virtual memory to:

Isolate processes: Each program thinks it has the entire memory space to itself
Security: Processes can't access each other's memory
Flexibility: Physical memory can be anywhere, even swapped to disk
Convenience: Programs don't need to know about physical addresses

But this creates a problem: every memory access needs translation from virtual to physical addresses. On a 64-bit system, this translation involves:

Splitting the virtual address into multiple parts
Walking through 4-5 levels of page tables
Each level requiring a memory access
Finally getting the physical address

That's 5 memory accesses just to do 1 memory access! This is where TLBs save the day.

TLB Operation Visualization

Virtual Addresses

Recent Accesses

No accesses yet

TLB Cache Contents

VPN	PPN	ASID
0x7FFF8000	0x12340000	1
0x00400000	0x56780000	1
0x7FFF9000	0x12350000	1
0x00401000	0x56790000	2
0x10000000	0xABCD0000	1
0x20000000	0xDEF00000	3

Valid Entries

Hits

Misses

How Address Translation Works

Virtual addresses are broken into multiple parts, each used to index into different levels of page tables. Let's see this process in action:

Virtual to Physical Address Translation

48-bit Virtual Address

Address Component Breakdown

PML4 IndexBits 47-39

0x94

Decimal: 148

Page Map Level 4

PDP IndexBits 38-30

0x1FE

Decimal: 510

Page Directory Pointer

PD IndexBits 29-21

0x1FC

Decimal: 508

Page Directory

PT IndexBits 20-12

0xC4

Decimal: 196

Page Table

Page OffsetBits 11-0

0xA28

Decimal: 2600

Offset within 4KB page

Translation Process

Virtual Address

0x00007FFFBF8C4A28

Page Tables

PML4[148] → PDP

PDP[510] → PD

PD[508] → PT

PT[196] → Page

Physical Address

0x0000000089ABCA28

Note: Each page table lookup requires a memory access. That's why TLB caching is critical - it avoids these 4-5 memory reads!

x86-64 Addressing

• Only 48 bits used of 64-bit address
• Bits 63-48 must match bit 47 (sign extension)
• Supports 256 TB of virtual address space
• 4KB pages are standard (12-bit offset)

Page Table Levels

• Each level has 512 entries (9 bits)
• Each entry is 8 bytes
• Page table size: 4KB (512 × 8)
• 5-level paging adds PML5 for 57-bit addressing

Address Components (x86-64)

A 48-bit virtual address is divided into:

Bits 47-39: Page Map Level 4 (PML4) index (9 bits = 512 entries)
Bits 38-30: Page Directory Pointer (PDP) index
Bits 29-21: Page Directory (PD) index
Bits 20-12: Page Table (PT) index
Bits 11-0: Page offset (4KB pages)

Each level points to the next, creating a tree structure that maps virtual to physical addresses.

The Page Table Walk

When a virtual address isn't in the TLB (a TLB miss), the CPU must perform a page table walk - a multi-step process to find the physical address:

Page Table Walk Process

Virtual Address (48-bit)

48 bits used of 64

190

PML4 (9b)

510

PDP (9b)

PD (9b)

421

PT (9b)

0xf28

Offset (12b)

Walk Steps

Virtual Address

+0 cycles

Starting with virtual address from CPU

0x00007FFF8C3A5F28

Read CR3 Register

+1 cycles

Get PML4 table base address from CR3

Walk PML4 Table

+15 cycles

Index 190 into PML4 table

Walk PDP Table

+15 cycles

Index 510 into PDP table

Walk PD Table

+15 cycles

Index 97 into PD table

Walk Page Table

+15 cycles

Index 421 into page table

Physical Address

+2 cycles

Add offset 3880 to page base

Memory Hierarchy

CPU Core

Requests translation

Page Tables in RAM

4-5 levels deep

Performance Impact

Current Step:Ready

Cycles Elapsed:0 cycles

Memory Accesses:0 reads

The Walk Process:

Read CR3 Register: Contains physical address of PML4 table
Index PML4: Use bits 47-39 to find PDP table address
Index PDP: Use bits 38-30 to find PD table address
Index PD: Use bits 29-21 to find PT table address
Index PT: Use bits 20-12 to find physical page address
Add Offset: Combine with bits 11-0 for final address

Each step requires a memory access. On modern CPUs, this can take 100-500 cycles!

TLB Hit vs Miss Performance

The performance difference between TLB hits and misses is dramatic. Let's visualize the impact:

TLB Hit vs Miss Performance Impact

Access Pattern

Accessing memory in order (best case)

Memory Access Pattern

Latency Comparison (ns)

TLB Hit0.5 ns

L1 TLB Miss3 ns

L2 TLB Miss (Page Walk)80 ns

Page Fault5 ms

10,000,000× slower than TLB hit!

Simulation Statistics

Hits

Misses

Hit Rate

Speedup

0×

Total Time:0.0 ns

Average Latency:0 ns

Accesses:0

Performance Impact

Why TLB Matters:

• TLB hit: 0.5-1 CPU cycles
• Page walk: 100-500 cycles
• 99%+ hit rate needed for good performance

Optimization Tips:

• Use huge pages (2MB/1GB)
• Improve memory locality
• Minimize working set size

Performance Numbers:

Scenario	Latency	Relative Speed
TLB Hit	0.5-1 ns	1x (baseline)
L1 TLB Miss, L2 TLB Hit	2-5 ns	3-5x slower
L2 TLB Miss (Page Walk)	50-100 ns	50-100x slower
Page Fault (Disk)	1-10 ms	1,000,000x slower

TLB Architecture

Modern CPUs have multiple TLB levels, similar to cache hierarchies:

L1 TLB (Fastest, Smallest)

I-TLB: 64-128 entries for instruction pages
D-TLB: 64-128 entries for data pages
Latency: 0.5-1 cycle
Fully associative or 4-8 way set-associative

L2 TLB (Unified)

Size: 512-2048 entries
Latency: 5-7 cycles
Shared between instructions and data
8-16 way set-associative

Page Size Support

4KB pages: Standard, most common
2MB pages: Large pages (fewer TLB entries needed)
1GB pages: Huge pages (even fewer entries)

Larger pages mean fewer TLB entries needed to map the same amount of memory!

TLB Management Strategies

1. TLB Flushing

When virtual-to-physical mappings change, TLB entries become stale and must be invalidated:

// Full TLB flush (expensive!)
__asm__ volatile("mov %%cr3, %%rax; mov %%rax, %%cr3" ::: "rax");

// Single page flush (better)
__asm__ volatile("invlpg (%0)" :: "r"(virtual_address));

2. ASID/PCID (Address Space IDs)

Modern CPUs tag TLB entries with process IDs, avoiding flushes on context switches:

Intel: PCID (Process Context ID)
ARM: ASID (Address Space ID)
Allows multiple processes' mappings in TLB simultaneously

3. TLB Shootdown

In multicore systems, when one core changes page tables, it must invalidate TLB entries on other cores:

Core 0 modifies page table
Core 0 sends Inter-Processor Interrupt (IPI) to other cores
Other cores flush relevant TLB entries
Acknowledge completion
Core 0 continues

This is expensive and can cause system-wide stalls!

Optimizing for TLB Performance

1. Use Huge Pages

# Linux: Enable transparent huge pages
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# Allocate huge pages explicitly
echo 1024 > /proc/sys/vm/nr_hugepages

Benefits:

2MB page covers 512× more memory than 4KB page
Dramatically reduces TLB pressure
Essential for large memory applications

2. Improve Locality

// Bad: Random access pattern
for (int i = 0; i < N; i++) {
    sum += data[random_index[i]];  // TLB thrashing
}

// Good: Sequential access
for (int i = 0; i < N; i++) {
    sum += data[i];  // TLB friendly
}

3. Minimize Working Set

Keep frequently accessed data together:

// Bad: Sparse data structure
struct BadNode {
    int value;
    char padding[4088];  // Forces new page per node
};

// Good: Dense data structure
struct GoodNode {
    int value;
    // No padding, multiple nodes per page
};

Real-World TLB Sizes

CPU	L1 I-TLB	L1 D-TLB	L2 TLB	Large Page Support
Intel Core i9-14900K	128 entries	64 entries	2048 entries	2MB, 1GB
AMD Ryzen 9 7950X	64 entries	72 entries	3072 entries	2MB, 1GB
Apple M3	192 entries	128 entries	3072 entries	16KB, 2MB
ARM Cortex-A78	48 entries	48 entries	1280 entries	4KB-1GB

TLB and Security

TLBs have been involved in several security vulnerabilities:

Meltdown (2018)

Exploited speculative execution to read kernel memory
TLB plays role in address translation during speculation
Mitigation: KPTI (Kernel Page Table Isolation)

TLBleed (2018)

Side-channel attack using TLB timing
Could leak cryptographic keys
Mitigation: Disable hyper-threading, timing obfuscation

RIDL/Fallout (2019)

Microarchitectural data sampling
TLB-related buffers could leak data
Mitigation: Microcode updates, flushing buffers

Common TLB Issues and Solutions

Problem 1: TLB Thrashing

Symptom: High TLB miss rate, poor performance Solution:

Use huge pages
Improve memory access patterns
Reduce working set size

Problem 2: Context Switch Overhead

Symptom: Performance drops with many processes Solution:

Enable PCID/ASID
Use process affinity
Reduce context switch frequency

Problem 3: NUMA Effects

Symptom: Inconsistent memory performance Solution:

NUMA-aware memory allocation
Process pinning to NUMA nodes
Local page allocation policies

Monitoring TLB Performance

Linux Perf Events

# Monitor TLB misses
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./program

# Detailed TLB statistics
perf record -e tlb:tlb_flush ./program
perf report

Key Metrics to Watch:

TLB Miss Rate: Should be < 1% for good performance
Page Walk Duration: Average cycles per walk
TLB Flush Frequency: High frequency indicates problems
Huge Page Utilization: More is generally better

Future of TLBs

Emerging Trends:

Larger TLBs: More entries to handle growing memory
Better Prefetching: Predicting TLB misses
Hardware Page Walk Acceleration: Faster miss handling
Variable Page Sizes: More flexibility (Intel 5-level paging)
Persistent Memory Support: New translation mechanisms

Key Takeaways

TLB Essentials

• Purpose: Cache virtual→physical translations

• Hit Rate: 99%+ for good performance

• Miss Penalty: 50-100× slower than hit

• Huge Pages: Reduce TLB pressure dramatically

• Levels: L1 (fast/small) and L2 (larger)

• Management: Hardware with OS assistance

• Optimization: Locality and huge pages

• Monitoring: Use perf events

The TLB is a perfect example of how a small cache can have enormous system-wide impact. By storing just a few thousand translations, it makes virtual memory practical, enabling the process isolation and memory protection we rely on every day.

Table of Contents

TLB Operation Visualization

Virtual Addresses

Recent Accesses

TLB Cache Contents

Virtual to Physical Address Translation

Address Component Breakdown

Translation Process

x86-64 Addressing

Page Table Levels

Page Table Walk Process

Walk Steps

Virtual Address

Read CR3 Register

Walk PML4 Table

Walk PDP Table

Walk PD Table

Walk Page Table

Physical Address

Memory Hierarchy

Performance Impact

TLB Hit vs Miss Performance Impact

Memory Access Pattern

Latency Comparison (ns)

Simulation Statistics

Performance Impact

TLB Essentials