TLB: How CPUs Translate Virtual to Physical Memory

15 min

Deep dive into Translation Lookaside Buffers - the critical cache that makes virtual memory fast. Interactive visualizations of address translation, page walks, and TLB management.

Best viewed on desktop for optimal interactive experience

What is a TLB?

The Translation Lookaside Buffer (TLB) is a specialized cache that stores recent translations from virtual memory addresses to physical memory addresses. Without TLBs, every memory access would require multiple additional memory lookups just to find where data actually lives in RAM - making programs 5-10x slower!

Think of the TLB as a GPS cache: instead of calculating the route from your address to a destination every time, it remembers recent routes. When you visit the same places frequently (which programs do with memory), this cache saves enormous amounts of time.

Why Do We Need Virtual Memory?

Before understanding TLBs, let's understand the problem they solve:

The Virtual Memory Challenge

Modern operating systems use virtual memory to:

  • Isolate processes: Each program thinks it has the entire memory space to itself
  • Security: Processes can't access each other's memory
  • Flexibility: Physical memory can be anywhere, even swapped to disk
  • Convenience: Programs don't need to know about physical addresses

But this creates a problem: every memory access needs translation from virtual to physical addresses. On a 64-bit system, this translation involves:

  1. Splitting the virtual address into multiple parts
  2. Walking through 4-5 levels of page tables
  3. Each level requiring a memory access
  4. Finally getting the physical address

That's 5 memory accesses just to do 1 memory access! This is where TLBs save the day.

TLB Operation Visualization

Virtual Addresses

Recent Accesses

No accesses yet

TLB Cache Contents

VPNPPNASIDValidUsed
0x7FFF80000x123400001
0x004000000x567800001
0x7FFF90000x123500001
0x004010000x567900002
0x100000000xABCD00001
0x200000000xDEF000003

5

Valid Entries

0

Hits

0

Misses

How Address Translation Works

Virtual addresses are broken into multiple parts, each used to index into different levels of page tables. Let's see this process in action:

Virtual to Physical Address Translation

Address Component Breakdown

PML4 IndexBits 47-39
0x94
Decimal: 148

Page Map Level 4

PDP IndexBits 38-30
0x1FE
Decimal: 510

Page Directory Pointer

PD IndexBits 29-21
0x1FC
Decimal: 508

Page Directory

PT IndexBits 20-12
0xC4
Decimal: 196

Page Table

Page OffsetBits 11-0
0xA28
Decimal: 2600

Offset within 4KB page

Translation Process

Virtual Address
0x00007FFFBF8C4A28
Page Tables
PML4[148] → PDP
PDP[510] → PD
PD[508] → PT
PT[196] → Page
Physical Address
0x0000000089ABCA28
Note: Each page table lookup requires a memory access. That's why TLB caching is critical - it avoids these 4-5 memory reads!
x86-64 Addressing
  • • Only 48 bits used of 64-bit address
  • • Bits 63-48 must match bit 47 (sign extension)
  • • Supports 256 TB of virtual address space
  • • 4KB pages are standard (12-bit offset)
Page Table Levels
  • • Each level has 512 entries (9 bits)
  • • Each entry is 8 bytes
  • • Page table size: 4KB (512 × 8)
  • • 5-level paging adds PML5 for 57-bit addressing

Address Components (x86-64)

A 48-bit virtual address is divided into:

  • Bits 47-39: Page Map Level 4 (PML4) index (9 bits = 512 entries)
  • Bits 38-30: Page Directory Pointer (PDP) index
  • Bits 29-21: Page Directory (PD) index
  • Bits 20-12: Page Table (PT) index
  • Bits 11-0: Page offset (4KB pages)

Each level points to the next, creating a tree structure that maps virtual to physical addresses.

The Page Table Walk

When a virtual address isn't in the TLB (a TLB miss), the CPU must perform a page table walk - a multi-step process to find the physical address:

Page Table Walk Process

48 bits used of 64
190
PML4 (9b)
510
PDP (9b)
97
PD (9b)
421
PT (9b)
0xf28
Offset (12b)

Walk Steps

Virtual Address
+0 cycles

Starting with virtual address from CPU

0x00007FFF8C3A5F28
1
Read CR3 Register
+1 cycles

Get PML4 table base address from CR3

2
Walk PML4 Table
+15 cycles

Index 190 into PML4 table

3
Walk PDP Table
+15 cycles

Index 510 into PDP table

4
Walk PD Table
+15 cycles

Index 97 into PD table

5
Walk Page Table
+15 cycles

Index 421 into page table

6
Physical Address
+2 cycles

Add offset 3880 to page base

Memory Hierarchy

CPU Core
Requests translation
Page Tables in RAM
4-5 levels deep
Performance Impact
Current Step:Ready
Cycles Elapsed:0 cycles
Memory Accesses:0 reads

The Walk Process:

  1. Read CR3 Register: Contains physical address of PML4 table
  2. Index PML4: Use bits 47-39 to find PDP table address
  3. Index PDP: Use bits 38-30 to find PD table address
  4. Index PD: Use bits 29-21 to find PT table address
  5. Index PT: Use bits 20-12 to find physical page address
  6. Add Offset: Combine with bits 11-0 for final address

Each step requires a memory access. On modern CPUs, this can take 100-500 cycles!

TLB Hit vs Miss Performance

The performance difference between TLB hits and misses is dramatic. Let's visualize the impact:

TLB Hit vs Miss Performance Impact

Accessing memory in order (best case)

Memory Access Pattern

H
H
H
H
H
H
H
H
M
H

Latency Comparison (ns)

TLB Hit0.5 ns
L1 TLB Miss3 ns
L2 TLB Miss (Page Walk)80 ns
Page Fault5 ms
10,000,000× slower than TLB hit!

Simulation Statistics

Hits

0

Misses

0

Hit Rate

0%

Speedup

0×

Total Time:0.0 ns
Average Latency:0 ns
Accesses:0

Performance Impact

Why TLB Matters:

  • • TLB hit: 0.5-1 CPU cycles
  • • Page walk: 100-500 cycles
  • • 99%+ hit rate needed for good performance

Optimization Tips:

  • • Use huge pages (2MB/1GB)
  • • Improve memory locality
  • • Minimize working set size

Performance Numbers:

ScenarioLatencyRelative Speed
TLB Hit0.5-1 ns1x (baseline)
L1 TLB Miss, L2 TLB Hit2-5 ns3-5x slower
L2 TLB Miss (Page Walk)50-100 ns50-100x slower
Page Fault (Disk)1-10 ms1,000,000x slower

TLB Architecture

Modern CPUs have multiple TLB levels, similar to cache hierarchies:

L1 TLB (Fastest, Smallest)

  • I-TLB: 64-128 entries for instruction pages
  • D-TLB: 64-128 entries for data pages
  • Latency: 0.5-1 cycle
  • Fully associative or 4-8 way set-associative

L2 TLB (Unified)

  • Size: 512-2048 entries
  • Latency: 5-7 cycles
  • Shared between instructions and data
  • 8-16 way set-associative

Page Size Support

  • 4KB pages: Standard, most common
  • 2MB pages: Large pages (fewer TLB entries needed)
  • 1GB pages: Huge pages (even fewer entries)

Larger pages mean fewer TLB entries needed to map the same amount of memory!

TLB Management Strategies

1. TLB Flushing

When virtual-to-physical mappings change, TLB entries become stale and must be invalidated:

// Full TLB flush (expensive!) __asm__ volatile("mov %%cr3, %%rax; mov %%rax, %%cr3" ::: "rax"); // Single page flush (better) __asm__ volatile("invlpg (%0)" :: "r"(virtual_address));

2. ASID/PCID (Address Space IDs)

Modern CPUs tag TLB entries with process IDs, avoiding flushes on context switches:

  • Intel: PCID (Process Context ID)
  • ARM: ASID (Address Space ID)
  • Allows multiple processes' mappings in TLB simultaneously

3. TLB Shootdown

In multicore systems, when one core changes page tables, it must invalidate TLB entries on other cores:

  1. Core 0 modifies page table
  2. Core 0 sends Inter-Processor Interrupt (IPI) to other cores
  3. Other cores flush relevant TLB entries
  4. Acknowledge completion
  5. Core 0 continues

This is expensive and can cause system-wide stalls!

Optimizing for TLB Performance

1. Use Huge Pages

# Linux: Enable transparent huge pages echo always > /sys/kernel/mm/transparent_hugepage/enabled # Allocate huge pages explicitly echo 1024 > /proc/sys/vm/nr_hugepages

Benefits:

  • 2MB page covers 512× more memory than 4KB page
  • Dramatically reduces TLB pressure
  • Essential for large memory applications

2. Improve Locality

// Bad: Random access pattern for (int i = 0; i < N; i++) { sum += data[random_index[i]]; // TLB thrashing } // Good: Sequential access for (int i = 0; i < N; i++) { sum += data[i]; // TLB friendly }

3. Minimize Working Set

Keep frequently accessed data together:

// Bad: Sparse data structure struct BadNode { int value; char padding[4088]; // Forces new page per node }; // Good: Dense data structure struct GoodNode { int value; // No padding, multiple nodes per page };

Real-World TLB Sizes

CPUL1 I-TLBL1 D-TLBL2 TLBLarge Page Support
Intel Core i9-14900K128 entries64 entries2048 entries2MB, 1GB
AMD Ryzen 9 7950X64 entries72 entries3072 entries2MB, 1GB
Apple M3192 entries128 entries3072 entries16KB, 2MB
ARM Cortex-A7848 entries48 entries1280 entries4KB-1GB

TLB and Security

TLBs have been involved in several security vulnerabilities:

Meltdown (2018)

  • Exploited speculative execution to read kernel memory
  • TLB plays role in address translation during speculation
  • Mitigation: KPTI (Kernel Page Table Isolation)

TLBleed (2018)

  • Side-channel attack using TLB timing
  • Could leak cryptographic keys
  • Mitigation: Disable hyper-threading, timing obfuscation

RIDL/Fallout (2019)

  • Microarchitectural data sampling
  • TLB-related buffers could leak data
  • Mitigation: Microcode updates, flushing buffers

Common TLB Issues and Solutions

Problem 1: TLB Thrashing

Symptom: High TLB miss rate, poor performance Solution:

  • Use huge pages
  • Improve memory access patterns
  • Reduce working set size

Problem 2: Context Switch Overhead

Symptom: Performance drops with many processes Solution:

  • Enable PCID/ASID
  • Use process affinity
  • Reduce context switch frequency

Problem 3: NUMA Effects

Symptom: Inconsistent memory performance Solution:

  • NUMA-aware memory allocation
  • Process pinning to NUMA nodes
  • Local page allocation policies

Monitoring TLB Performance

Linux Perf Events

# Monitor TLB misses perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses ./program # Detailed TLB statistics perf record -e tlb:tlb_flush ./program perf report

Key Metrics to Watch:

  • TLB Miss Rate: Should be < 1% for good performance
  • Page Walk Duration: Average cycles per walk
  • TLB Flush Frequency: High frequency indicates problems
  • Huge Page Utilization: More is generally better

Future of TLBs

  1. Larger TLBs: More entries to handle growing memory
  2. Better Prefetching: Predicting TLB misses
  3. Hardware Page Walk Acceleration: Faster miss handling
  4. Variable Page Sizes: More flexibility (Intel 5-level paging)
  5. Persistent Memory Support: New translation mechanisms

Key Takeaways

TLB Essentials

• Purpose: Cache virtual→physical translations

• Hit Rate: 99%+ for good performance

• Miss Penalty: 50-100× slower than hit

• Huge Pages: Reduce TLB pressure dramatically

• Levels: L1 (fast/small) and L2 (larger)

• Management: Hardware with OS assistance

• Optimization: Locality and huge pages

• Monitoring: Use perf events

The TLB is a perfect example of how a small cache can have enormous system-wide impact. By storing just a few thousand translations, it makes virtual memory practical, enabling the process isolation and memory protection we rely on every day.

If you found this explanation helpful, consider sharing it with others.

Mastodon