Linux Memory Management: Virtual Memory, Paging, and Beyond

The Magic of Virtual Memory

Imagine if every program had to manage physical memory directly - chaos would ensue! Programs would overwrite each other, security would be impossible, and memory fragmentation would cripple your system. Enter virtual memory - Linux's elegant solution that gives every process its own private universe of memory.

Virtual memory is like a massive hotel where each guest (process) believes they have the entire building to themselves. The kernel, acting as the hotel manager, secretly maps their room numbers (virtual addresses) to actual rooms (physical addresses). This illusion is so perfect that processes never know they're sharing.

Let's dive into this fascinating world where addresses lie, memory can be larger than RAM, and the CPU's MMU performs millions of translations per second.

Interactive Memory Management

Explore virtual memory translation, page tables, TLB cache, and swapping in action:

Virtual to Physical Address Translation

Virtual Address

TLB Lookup

Physical Address

Page Faults

TLB Hit Rate

Virtual Memory Architecture

Address Spaces

Every process gets its own 48-bit virtual address space (on x86_64):

// Virtual address space layout (x86_64 Linux)
// 0x0000000000000000 - 0x00007FFFFFFFFFFF  User space (128 TB)
// 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF  Kernel space (128 TB)

// View process memory map
cat /proc/self/maps

// Example output:
00400000-00401000 r-xp  /usr/bin/cat     # Code segment
00601000-00602000 rw-p  /usr/bin/cat     # Data segment
7fff12345000-7fff12366000 rw-p [stack]   # Stack
7fff12366000-7fff12368000 r-xp [vdso]    # Virtual syscall

Memory Regions

struct mm_struct {
    struct vm_area_struct *mmap;  // List of memory regions
    pgd_t *pgd;                    // Page Global Directory
    
    unsigned long start_code, end_code;   // Code segment
    unsigned long start_data, end_data;   // Data segment
    unsigned long start_brk, brk;         // Heap
    unsigned long start_stack;            // Stack
    
    unsigned long total_vm;        // Total pages mapped
    unsigned long locked_vm;       // Pages locked in RAM
    unsigned long pinned_vm;       // Pages pinned
    unsigned long data_vm;         // Data pages
    unsigned long exec_vm;         // Executable pages
    unsigned long stack_vm;        // Stack pages
};

Page Tables: The Address Translation

Multi-Level Page Tables

Linux uses 4-level page tables on x86_64:

// Virtual address breakdown (48-bit addressing)
// Bits 47-39: PGD index (Page Global Directory)
// Bits 38-30: PUD index (Page Upper Directory)  
// Bits 29-21: PMD index (Page Middle Directory)
// Bits 20-12: PTE index (Page Table Entry)
// Bits 11-0:  Offset within page (4KB)

// Address translation pseudocode
physical_address translate(virtual_address va) {
    pgd_entry = pgd_table[PGD_INDEX(va)];
    if (!pgd_entry.present) page_fault();
    
    pud_entry = pud_table[PUD_INDEX(va)];
    if (!pud_entry.present) page_fault();
    
    pmd_entry = pmd_table[PMD_INDEX(va)];
    if (!pmd_entry.present) page_fault();
    
    pte_entry = page_table[PTE_INDEX(va)];
    if (!pte_entry.present) page_fault();
    
    return (pte_entry.frame << 12) | PAGE_OFFSET(va);
}

Page Table Entry Format

// Page Table Entry (PTE) structure
typedef struct {
    unsigned long present    : 1;   // Page in memory?
    unsigned long rw        : 1;   // Read/write permission
    unsigned long user      : 1;   // User accessible?
    unsigned long pwt       : 1;   // Page write-through
    unsigned long pcd       : 1;   // Page cache disabled
    unsigned long accessed  : 1;   // Recently accessed?
    unsigned long dirty     : 1;   // Modified?
    unsigned long pat       : 1;   // Page attribute table
    unsigned long global    : 1;   // Global page
    unsigned long available : 3;   // OS can use
    unsigned long frame     : 40;  // Physical frame number
    unsigned long reserved  : 11;  // Reserved
    unsigned long nx        : 1;   // No execute
} pte_t;

TLB: The Speed Cache

The Translation Lookaside Buffer caches recent address translations:

# View TLB statistics
perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses command

# Example output:
# 1,234,567 dTLB-loads
#    12,345 dTLB-load-misses  # 1.0% miss rate
#   987,654 iTLB-loads  
#     1,234 iTLB-load-misses  # 0.1% miss rate

TLB Management

// Flush TLB entries
static inline void flush_tlb_page(unsigned long addr) {
    asm volatile("invlpg (%0)" ::"r" (addr) : "memory");
}

// Flush entire TLB
static inline void flush_tlb_all(void) {
    // Reload CR3 to flush TLB
    unsigned long cr3;
    asm volatile("mov %%cr3, %0" : "=r" (cr3));
    asm volatile("mov %0, %%cr3" : : "r" (cr3));
}

// Context switch TLB handling
void switch_mm(struct mm_struct *prev, struct mm_struct *next) {
    // Load new page table base
    load_cr3(next->pgd);
    
    // ASID/PCID prevents full TLB flush on modern CPUs
    if (cpu_has_pcid) {
        write_cr3(__pa(next->pgd) | next->context.ctx_id);
    }
}

Memory Allocation

Kernel Memory Allocation

// kmalloc - allocate contiguous physical memory
void *ptr = kmalloc(size, GFP_KERNEL);
kfree(ptr);

// vmalloc - allocate virtually contiguous memory
void *ptr = vmalloc(size);  // May not be physically contiguous
vfree(ptr);

// Page allocation
struct page *page = alloc_pages(GFP_KERNEL, order);
__free_pages(page, order);

// Slab allocator for fixed-size objects
struct kmem_cache *cache = kmem_cache_create("my_cache",
                                            sizeof(struct my_struct),
                                            0, SLAB_HWCACHE_ALIGN, NULL);
void *obj = kmem_cache_alloc(cache, GFP_KERNEL);
kmem_cache_free(cache, obj);

User Space Allocation

// brk/sbrk - extend heap
void *old_brk = sbrk(0);      // Get current break
sbrk(4096);                    // Extend by 4KB
void *new_brk = sbrk(0);      // New break position

// mmap - map memory
void *addr = mmap(NULL,        // Kernel chooses address
                  length,       // Size
                  PROT_READ | PROT_WRITE,
                  MAP_PRIVATE | MAP_ANONYMOUS,
                  -1,          // No file
                  0);          // Offset
munmap(addr, length);

// Memory locking
mlock(addr, length);   // Lock in RAM (no swap)
munlock(addr, length);
mlockall(MCL_CURRENT | MCL_FUTURE);  // Lock all

Demand Paging & Page Faults

Page Fault Handler

// Simplified page fault handler
void do_page_fault(struct pt_regs *regs, unsigned long error_code) {
    unsigned long address = read_cr2();  // Faulting address
    struct mm_struct *mm = current->mm;
    struct vm_area_struct *vma;
    
    // Find VMA containing address
    vma = find_vma(mm, address);
    if (!vma || vma->vm_start > address) {
        // Segmentation fault
        send_sig(SIGSEGV, current, 0);
        return;
    }
    
    // Check permissions
    if (error_code & PF_WRITE) {
        if (!(vma->vm_flags & VM_WRITE))
            goto bad_area;
    }
    
    // Handle fault based on type
    if (!pte_present(*pte)) {
        // Not present - could be swapped
        if (pte_swapped(*pte)) {
            do_swap_page(vma, address, pte);
        } else {
            do_anonymous_page(vma, address, pte);
        }
    } else if (pte_write(*pte)) {
        // Copy-on-write
        do_wp_page(vma, address, pte);
    }
}

Types of Page Faults

# Monitor page faults
perf stat -e page-faults,minor-faults,major-faults command

# Types:
# Minor fault - page in memory, just needs mapping
# Major fault - page must be loaded from disk
# Invalid fault - segmentation fault

# View page fault stats
cat /proc/vmstat | grep pgfault

Swapping & Page Replacement

Swap Management

# View swap usage
free -h
swapon --show

# Create swap file
dd if=/dev/zero of=/swapfile bs=1G count=4
chmod 600 /swapfile
mkswap /swapfile
swapon /swapfile

# Swappiness (0-100, default 60)
cat /proc/sys/vm/swappiness
echo 10 > /proc/sys/vm/swappiness  # Prefer RAM

# View swap statistics
vmstat 1
# si - swap in (KB/s)
# so - swap out (KB/s)

Page Replacement Algorithms

// Linux uses a variant of LRU (Least Recently Used)
// Two lists: active and inactive

struct lruvec {
    struct list_head lists[NR_LRU_LISTS];
    // LRU_INACTIVE_ANON   - Inactive anonymous pages
    // LRU_ACTIVE_ANON     - Active anonymous pages
    // LRU_INACTIVE_FILE   - Inactive file pages
    // LRU_ACTIVE_FILE     - Active file pages
    // LRU_UNEVICTABLE     - Cannot be evicted
};

// Page reclaim algorithm (simplified)
void shrink_page_list(struct list_head *page_list) {
    list_for_each_entry(page, page_list, lru) {
        if (PageActive(page))
            continue;  // Skip active pages
            
        if (PageDirty(page)) {
            if (pageout(page) == PAGE_SUCCESS)
                continue;  // Written to swap
        }
        
        if (page_mapped(page))
            try_to_unmap(page);  // Remove mappings
            
        if (page_count(page) == 0)
            free_page(page);  // Reclaim page
    }
}

Memory Zones

Linux divides physical memory into zones:

enum zone_type {
    ZONE_DMA,      // 0-16MB (ISA DMA)
    ZONE_DMA32,    // 0-4GB (32-bit DMA)
    ZONE_NORMAL,   // Normal memory
    ZONE_HIGHMEM,  // >896MB on 32-bit
    ZONE_MOVABLE,  // Movable pages
    ZONE_DEVICE,   // Device memory
};

// View zones
cat /proc/zoneinfo

// Zone watermarks
struct zone {
    unsigned long watermark[NR_WMARK];
    // WMARK_MIN  - Minimum free pages
    // WMARK_LOW  - Start background reclaim
    // WMARK_HIGH - Stop background reclaim
};

Copy-on-Write (COW)

COW delays copying until necessary:

// Fork creates COW mappings
pid_t fork() {
    // Parent and child share pages
    // Pages marked read-only
    // Copy happens on first write
}

// COW page fault handler
void do_wp_page(struct vm_area_struct *vma, unsigned long address) {
    struct page *old_page = pte_page(*pte);
    struct page *new_page;
    
    // Check if we&apos;re the only user
    if (page_count(old_page) == 1) {
        // We own it - just make writable
        pte = pte_mkwrite(*pte);
        return;
    }
    
    // Need to copy
    new_page = alloc_page(GFP_KERNEL);
    copy_page(new_page, old_page);
    
    // Update page table
    *pte = mk_pte(new_page, vma->vm_page_prot);
    *pte = pte_mkwrite(*pte);
}

Huge Pages

Reduce TLB pressure with larger pages:

# Transparent Huge Pages (THP)
cat /sys/kernel/mm/transparent_hugepage/enabled
echo always > /sys/kernel/mm/transparent_hugepage/enabled

# Static huge pages
echo 100 > /proc/sys/vm/nr_hugepages
mount -t hugetlbfs none /mnt/hugepages

# View huge page info
cat /proc/meminfo | grep Huge

// Allocate huge pages
void *addr = mmap(NULL, size,
                  PROT_READ | PROT_WRITE,
                  MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB,
                  -1, 0);

// Check if using huge pages
cat /proc/[pid]/smaps | grep AnonHugePages

OOM Killer

When memory is exhausted:

// OOM score calculation
int oom_badness(struct task_struct *p) {
    long points;
    
    // Base score on memory usage
    points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS);
    
    // Adjust for oom_score_adj (-1000 to 1000)
    points *= (1000 + p->signal->oom_score_adj);
    points /= 1000;
    
    return points;
}

// Prevent OOM killer
echo -1000 > /proc/[pid]/oom_score_adj  # Disable
echo 1000 > /proc/[pid]/oom_score_adj   # First to die

Memory Debugging

Tools and Techniques

# Valgrind - memory leak detection
valgrind --leak-check=full ./program

# AddressSanitizer
gcc -fsanitize=address -g program.c
./a.out

# System memory info
cat /proc/meminfo
vmstat -s
smem -tk  # Per-process memory

# Memory maps
pmap -x [pid]
cat /proc/[pid]/smaps_rollup

Common Issues

// Use-after-free
char *ptr = malloc(100);
free(ptr);
*ptr = &apos;x&apos;;  // BUG!

// Double free
free(ptr);
free(ptr);  // BUG!

// Memory leak
while (1) {
    malloc(100);  // Never freed
}

// Stack overflow
void recursive() {
    char large[1000000];  // Too big for stack
    recursive();
}

Best Practices

Use appropriate allocators - kmalloc for small, vmalloc for large
Lock critical pages - Prevent swapping of sensitive data
Monitor memory pressure - Watch for excessive swapping
Use huge pages - For large memory applications
Handle OOM gracefully - Set appropriate oom_score_adj
Profile memory usage - Find and fix leaks early
Understand NUMA - Optimize for memory locality
Tune vm parameters - Adjust swappiness, dirty ratios

Conclusion

Linux memory management is a masterpiece of engineering, seamlessly juggling virtual addresses, physical pages, and storage to create the illusion of infinite, private memory for each process. From the lightning-fast TLB to the clever COW optimization, every component works in harmony.

The interactive visualizations showed how addresses translate, pages swap, and memory pressure triggers the delicate dance of page replacement. Understanding these mechanisms helps you write more efficient programs and debug the trickiest memory issues.

Remember: in Linux, memory is virtual, pages are lazy, and the OOM killer is always watching.

Next: System Calls → ← Back to Process Management

Table of Contents

Virtual to Physical Address Translation

Virtual Address

TLB Lookup

Physical Address