Virtual Memory & TLB

Virtual memory is one of the most important abstractions in modern operating systems, providing each process with its own private address space while efficiently sharing physical memory. The Translation Lookaside Buffer (TLB) makes this abstraction practical by caching address translations, turning what would be 5+ memory accesses into just 1.

Together, they enable memory protection, efficient sharing, and the ability to run programs larger than physical memory - but only because the TLB makes translation fast enough to be practical.

Why Virtual Memory?

Modern operating systems use virtual memory to:

Isolate processes: Each program thinks it has the entire memory space to itself
Security: Processes can't access each other's memory
Flexibility: Physical memory can be anywhere, even swapped to disk
Convenience: Programs don't need to know about physical addresses

The Problem: Every memory access needs translation from virtual to physical addresses. On x86-64, this involves walking through 4 levels of page tables - that's 5 memory accesses just to do 1 memory access!

The Solution: TLBs cache recent translations, achieving >99% hit rates in practice.

Interactive Virtual Memory Explorer

Visualize the complete translation process - from TLB hit to page fault:

Step 1 of 6

Latency: 1-2 cycles

CPU requests memory

Access virtual address 0x10021A3C

Virtual Address Breakdown (x86-64 4-Level Paging)

PML4 Index

0x0

bits 39-47

PDPT Index

0x1

bits 30-38

PD Index

0x0

bits 21-29

PT Index

0x2

bits 12-20

Offset

0xa3c

bits 0-11

TLB (Translation Lookaside Buffer)

VPN: 0x10021PFN: 0x5

VPN: 0x20032PFN: 0x3

VPN: 0x40014PFN: 0x8

Empty

4-Level Page Table Walk

Level 1: PML4 (Page Map Level 4)

[0] → 0xA000

P=1

[1] → 0xB000

P=1

Level 2: PDPT (Page Directory Pointer Table) @ 0xA000

[0] → NULL

P=0

[1] → 0xC000

P=1

Level 3: PD (Page Directory) @ 0xC000

[0] → 0xE000

P=1

Level 4: PT (Page Table) @ 0xE000

PT[0]:

PFN: 0x4

P:1D:0A:0RW-

PT[1]:

PFN: 0x7

P:1D:1A:1RW-

PT[2]:

PFN: 0x5

P:1D:0A:1RW-

PT[3]:

PFN: 0x9

P:1D:0A:0R--

Physical Memory Frames

Frame 0x3

Process B Data

Frame 0x4

Process A Stack

Frame 0x5

Process A Heap

Frame 0x7

Shared Library

Frame 0x8

Process C Code

Frame 0x9

Process A Code

Virtual Memory Fundamentals

Key Concepts

Virtual Address Space: Each process sees a large, contiguous address space (e.g., 48-bit = 256 TB)
Physical Memory: Actual RAM divided into fixed-size frames (typically 4KB)
Pages: Virtual memory divided into fixed-size blocks matching frame size
Page Tables: Multi-level tree structure mapping virtual pages to physical frames
TLB: Small, fast cache storing recent virtual → physical translations

Address Translation

Virtual addresses are split into components:

Virtual\ Address = VPN × Page\ Size + Offset

Where:

VPN (Virtual Page Number): Maps to physical frame via page tables
Offset: Position within the page (stays the same after translation)

Multi-Level Page Tables

x86-64 uses 4-level page tables to save memory:

Address Breakdown (48-bit virtual address):

Bits 39-47: PML4 index (9 bits = 512 entries)
Bits 30-38: PDPT index (Page Directory Pointer Table)
Bits 21-29: PD index (Page Directory)
Bits 12-20: PT index (Page Table)
Bits 0-11: Offset (4KB pages = 12 bits)

Each level points to the next, creating a tree. Only allocate tables for actually-used memory regions.

Why 4 levels? A single-level table for 48-bit addresses would need 256 TB of memory just for the page table! Multi-level tables allocate on demand.

The Translation Process

Scenario 1: TLB Hit (~1-2 cycles)

CPU requests virtual address
TLB contains translation ✓
Physical address returned immediately
Access physical memory

Performance: ~1 nanosecond total

Scenario 2: TLB Miss (~10-20 cycles)

CPU requests virtual address
TLB doesn't contain translation ✗
Hardware page walker traverses 4 page table levels:
- Read PML4 entry → get PDPT address
- Read PDPT entry → get PD address
- Read PD entry → get PT address
- Read PT entry → get physical frame number
Update TLB with new translation
Access physical memory

Performance: ~10 nanoseconds (10× slower than hit)

Scenario 3: Page Fault (~1-10 million cycles!)

CPU requests virtual address
TLB miss → page table walk
Present bit = 0 - page not in memory!
CPU triggers page fault exception
OS page fault handler:
- Find free physical frame (or evict a page)
- Load page from disk/swap
- Update page table entry (set Present=1)
Resume instruction, retry access
Now TLB miss → page walk succeeds → TLB update
Finally access physical memory

Performance: 1-10 milliseconds (1,000,000× slower than hit!)

TLB: The Critical Performance Component

The TLB (Translation Lookaside Buffer) is a specialized cache that stores recent virtual → physical translations. Without it, every memory access would require 5 memory accesses (4 page table levels + 1 data access) - making programs 5-10× slower!

TLB Architecture

Modern CPUs have multiple TLB levels, similar to cache hierarchies:

L1 TLB (Fastest, Smallest):

I-TLB: 64-128 entries for instruction pages
D-TLB: 64-128 entries for data pages
Latency: 1 cycle (part of same cycle as cache access)
Fully associative or 4-8 way set-associative

L2 TLB (Unified):

Size: 512-2048 entries (shared between instructions and data)
Latency: 5-7 cycles
8-16 way set-associative

Page Size Support:

4KB pages: Standard, most common
2MB pages: Large pages (512× larger than 4KB)
1GB pages: Huge pages (262,144× larger than 4KB)

Why TLB Hit Rates Are So High

Programs exhibit temporal locality (reuse recent pages) and spatial locality (access nearby addresses). Since each page is 4KB, accessing just a few variables can keep you within the same page for hundreds of instructions.

Typical hit rates: 98-99.9% for well-behaved programs

Interactive TLB Deep Dive

Explore TLB internals - from set-associative lookup to the dramatic impact of page sizes:

Step 1 of 10

CPU Issues Virtual Address

Virtual address 0x12345000 needs translation to physical address

L1 TLB (4 entries, 2-way set-associative)

Latency: 1 cycle | Coverage: 4 pages

Set 0

Way 0

V:11111

P:2222

Way 1

Empty

Set 1

Way 0

V:67890

P:1234

Way 1

Empty

L2 TLB (8 entries, 4-way)

Latency: 5-7 cycles | Coverage: 8 pages

Entry 0

VPN:20000

PFN:3000

Entry 1

Empty

Entry 2

VPN:20002

PFN:3002

Entry 3

VPN:12345

PFN:abcd

Entry 4

VPN:20004

PFN:3004

Entry 5

Empty

Entry 6

VPN:20006

PFN:3006

Entry 7

Empty

💡 Set-Associative TLB Design

TLBs use set-associative design: VPN hashed to set, then all ways in set checked in parallel.

L1 TLB: fully-associative or 2-4 way (fast, small). L2 TLB: 4-16 way (larger, slower).

On L2 hit, entry promoted to L1 (temporal locality). On L2 miss, page table walk required.

Page Size Impact on Performance

One of the most powerful TLB optimizations is using larger page sizes:

4KB Pages (Standard)

TLB coverage: 64 entries × 4KB = 256 KB
For 1 GB workload: Need 262,144 pages
TLB hit rate: <1% (constant thrashing!)

2MB Pages (Large)

TLB coverage: 64 entries × 2MB = 128 MB
For 1 GB workload: Need 512 pages
TLB hit rate: ~12% (512× improvement in coverage!)

1GB Pages (Huge)

TLB coverage: 64 entries × 1GB = 64 GB
For 1 GB workload: Need 1 page
TLB hit rate: >99% (essentially eliminates TLB misses!)

Trade-off: Larger pages = more internal fragmentation (wasted space within pages). Use 2MB for most large-memory applications, 1GB only for huge datasets (databases, ML).

TLB Management

ASID/PCID (Address Space IDs)

Without ASID, every context switch would flush the entire TLB! Modern CPUs tag TLB entries with process IDs:

Intel: PCID (Process Context ID)
ARM: ASID (Address Space ID)

This allows multiple processes' translations to coexist in the TLB simultaneously.

TLB Shootdown (Multicore Synchronization)

When one core modifies page tables, it must invalidate stale TLB entries on all other cores:

Core 0 modifies page table
Core 0 flushes its own TLB (INVLPG instruction)
Core 0 sends Inter-Processor Interrupts (IPIs) to all other cores
Other cores receive IPI, flush relevant TLB entries
Other cores acknowledge completion
Core 0 resumes (waits for all acks)

Cost: 1,000-5,000 cycles! This is why frequent page table modifications are expensive.

TLB Flushing

Full flush (reload CR3 register): Invalidates entire TLB - very expensive!

Single-page flush (INVLPG instruction): Invalidates one entry - much better.

Smart OSes: Batch invalidations to minimize shootdown cost.

Page Replacement Algorithms

When physical memory is full and a page fault occurs, the OS must evict a page:

LRU (Least Recently Used)

Evict page unused the longest
Good performance but expensive to track perfectly
Common approximation: Clock algorithm (Second Chance)

Clock Algorithm (Second Chance)

Each page has "referenced" bit (set by hardware on access)
Circular list, hand sweeps through pages
If referenced=1: clear bit, move to next
If referenced=0: evict this page
Efficient LRU approximation used in practice

FIFO (First In, First Out)

Simple but suboptimal - can evict frequently-used pages
Rarely used in modern systems

Memory Mapping

Anonymous Mapping

Not backed by file (heap, stack)
Zero-initialized on first access
Swapped to swap space if evicted

File-Backed Mapping (mmap)

Maps file contents directly into virtual memory
Changes written back to file (if MAP_SHARED)
Enables efficient file I/O and shared libraries

Shared Memory

Multiple processes map same physical pages
Efficient inter-process communication
Used for databases, browsers (tab processes), shared libraries

Copy-on-Write (COW)

Optimization for fork() system call:

Parent and child initially share all pages (marked read-only)
On write attempt: page fault!
OS allocates new physical page, copies data
Both processes now have private copies
Only modified pages are actually copied

Benefits: Fast process creation, memory efficient (only copy what's modified)

Performance Optimizations

1. Use Huge Pages

Enable in Linux:

# Transparent huge pages (automatic)
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# Explicit huge pages
echo 1024 > /proc/sys/vm/nr_hugepages

When to use:

Large memory databases (PostgreSQL, Redis)
Machine learning training (PyTorch, TensorFlow)
Scientific computing (HPC workloads)
Any workload with >1GB working set

2. Improve Locality

Good: Sequential access keeps you in same pages

for (i = 0; i < N; i++)
    sum += array[i];  // TLB-friendly

Bad: Random access thrashes TLB

for (i = 0; i < N; i++)
    sum += array[random[i]];  // TLB-unfriendly

3. NUMA-Aware Allocation

On multi-socket systems, allocate memory on same NUMA node as accessing CPU:

numactl --cpunodebind=0 --membind=0 ./program

4. Prefaulting

Pre-allocate pages before they're needed (avoid page faults in critical sections):

mmap(addr, size, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_POPULATE, fd, 0);

Security Features

ASLR (Address Space Layout Randomization)

Randomizes stack, heap, library locations
Makes exploits much harder (can't predict addresses)
Minimal performance cost

NX/DEP (No Execute)

Mark pages as non-executable (page table NX bit)
Prevents code injection attacks
Hardware-enforced via page table permissions

Guard Pages

Unmapped pages around stacks
Catch buffer overflows (immediate page fault/crash)
Better than silent memory corruption

Common Issues and Solutions

TLB Thrashing

Symptom: High TLB miss rate, poor performance Solution: Use huge pages, improve locality, reduce working set

Context Switch Overhead

Symptom: Performance drops with many processes Solution: Enable PCID/ASID, use process affinity, reduce context switches

Page Thrashing

Symptom: Constant disk I/O, system nearly unresponsive Solution: Add more RAM, reduce working set, kill memory-hungry processes

NUMA Effects

Symptom: Inconsistent performance across runs Solution: NUMA-aware allocation, process/memory pinning

Monitoring Performance

Linux perf

# Monitor TLB misses
perf stat -e dTLB-load-misses,iTLB-load-misses ./program

# Monitor page faults
perf stat -e page-faults,minor-faults,major-faults ./program

# Detailed profile
perf record -e dTLB-load-misses ./program
perf report

Key Metrics

TLB Miss Rate: Should be <1% for good performance
Page Fault Rate: Major faults (disk I/O) should be rare
Huge Page Utilization: More is better for large workloads

Real-World TLB Sizes

CPU	L1 I-TLB	L1 D-TLB	L2 TLB	Large Pages
Intel Core i9-14900K	128	64	2048	2MB, 1GB
AMD Ryzen 9 7950X	64	72	3072	2MB, 1GB
Apple M3	192	128	3072	16KB, 2MB
ARM Cortex-A78	48	48	1280	4KB-1GB

Best Practices

Minimize Page Faults: Keep working set in memory, use mlock() for critical pages
Use Huge Pages: For large memory allocations (databases, ML, HPC)
NUMA-Aware Allocation: Place data near processing cores
Prefault Critical Pages: Avoid faults in hot paths (MAP_POPULATE)
Monitor TLB Misses: High miss rates indicate poor locality or need for huge pages
Batch Page Table Modifications: Minimize TLB shootdown overhead

CPU Cache Lines: Cache line interactions with paging
Memory Access Patterns: How patterns affect both cache and TLB
Memory Interleaving: Bank-level memory organization
NUMA Architecture: Memory locality in multi-socket systems

Conclusion

Virtual memory and TLBs are cornerstones of modern computing, enabling the process isolation and memory protection we rely on daily. The TLB is what makes virtual memory practical - without it, the 5× memory access overhead would be unbearable.

Understanding the 100-1000× performance difference between TLB hits, page table walks, and page faults is crucial for system programming. The combination of multi-level page tables (memory efficiency) and multi-level TLBs (speed) provides both the illusion of infinite memory and the reality of practical performance.

Key takeaway: Virtual memory gives us the abstraction. TLBs give us the performance. Together, they're fundamental to everything we do in computing.

Table of Contents

CPU requests memory

Virtual Address Breakdown (x86-64 4-Level Paging)

TLB (Translation Lookaside Buffer)

4-Level Page Table Walk

Physical Memory Frames

Step 1 of 10

CPU Issues Virtual Address

L1 TLB (4 entries, 2-way set-associative)

L2 TLB (8 entries, 4-way)

💡 Set-Associative TLB Design