Linux Memory Management: Virtual Memory, Paging, and Beyond

22 min

Explore Linux memory management through interactive visualizations. Understand virtual memory, page tables, TLB, swapping, and memory allocation.

Best viewed on desktop for optimal interactive experience

The Magic of Virtual Memory

Imagine if every program had to manage physical memory directly - chaos would ensue! Programs would overwrite each other, security would be impossible, and memory fragmentation would cripple your system. Enter virtual memory - Linux's elegant solution that gives every process its own private universe of memory.

Virtual memory is like a massive hotel where each guest (process) believes they have the entire building to themselves. The kernel, acting as the hotel manager, secretly maps their room numbers (virtual addresses) to actual rooms (physical addresses). This illusion is so perfect that processes never know they're sharing.

Let's dive into this fascinating world where addresses lie, memory can be larger than RAM, and the CPU's MMU performs millions of translations per second.

Interactive Memory Management

Explore virtual memory translation, page tables, TLB cache, and swapping in action:

Virtual to Physical Address Translation

Virtual Address

TLB Lookup

Physical Address

Page Faults
0
TLB Hit Rate
0%

Virtual Memory Architecture

Address Spaces

Every process gets its own 48-bit virtual address space (on x86_64):

// Virtual address space layout (x86_64 Linux) // 0x0000000000000000 - 0x00007FFFFFFFFFFF User space (128 TB) // 0xFFFF800000000000 - 0xFFFFFFFFFFFFFFFF Kernel space (128 TB) // View process memory map cat /proc/self/maps // Example output: 00400000-00401000 r-xp /usr/bin/cat # Code segment 00601000-00602000 rw-p /usr/bin/cat # Data segment 7fff12345000-7fff12366000 rw-p [stack] # Stack 7fff12366000-7fff12368000 r-xp [vdso] # Virtual syscall

Memory Regions

struct mm_struct { struct vm_area_struct *mmap; // List of memory regions pgd_t *pgd; // Page Global Directory unsigned long start_code, end_code; // Code segment unsigned long start_data, end_data; // Data segment unsigned long start_brk, brk; // Heap unsigned long start_stack; // Stack unsigned long total_vm; // Total pages mapped unsigned long locked_vm; // Pages locked in RAM unsigned long pinned_vm; // Pages pinned unsigned long data_vm; // Data pages unsigned long exec_vm; // Executable pages unsigned long stack_vm; // Stack pages };

Page Tables: The Address Translation

Multi-Level Page Tables

Linux uses 4-level page tables on x86_64:

// Virtual address breakdown (48-bit addressing) // Bits 47-39: PGD index (Page Global Directory) // Bits 38-30: PUD index (Page Upper Directory) // Bits 29-21: PMD index (Page Middle Directory) // Bits 20-12: PTE index (Page Table Entry) // Bits 11-0: Offset within page (4KB) // Address translation pseudocode physical_address translate(virtual_address va) { pgd_entry = pgd_table[PGD_INDEX(va)]; if (!pgd_entry.present) page_fault(); pud_entry = pud_table[PUD_INDEX(va)]; if (!pud_entry.present) page_fault(); pmd_entry = pmd_table[PMD_INDEX(va)]; if (!pmd_entry.present) page_fault(); pte_entry = page_table[PTE_INDEX(va)]; if (!pte_entry.present) page_fault(); return (pte_entry.frame << 12) | PAGE_OFFSET(va); }

Page Table Entry Format

// Page Table Entry (PTE) structure typedef struct { unsigned long present : 1; // Page in memory? unsigned long rw : 1; // Read/write permission unsigned long user : 1; // User accessible? unsigned long pwt : 1; // Page write-through unsigned long pcd : 1; // Page cache disabled unsigned long accessed : 1; // Recently accessed? unsigned long dirty : 1; // Modified? unsigned long pat : 1; // Page attribute table unsigned long global : 1; // Global page unsigned long available : 3; // OS can use unsigned long frame : 40; // Physical frame number unsigned long reserved : 11; // Reserved unsigned long nx : 1; // No execute } pte_t;

TLB: The Speed Cache

The Translation Lookaside Buffer caches recent address translations:

# View TLB statistics perf stat -e dTLB-loads,dTLB-load-misses,iTLB-loads,iTLB-load-misses command # Example output: # 1,234,567 dTLB-loads # 12,345 dTLB-load-misses # 1.0% miss rate # 987,654 iTLB-loads # 1,234 iTLB-load-misses # 0.1% miss rate

TLB Management

// Flush TLB entries static inline void flush_tlb_page(unsigned long addr) { asm volatile("invlpg (%0)" ::"r" (addr) : "memory"); } // Flush entire TLB static inline void flush_tlb_all(void) { // Reload CR3 to flush TLB unsigned long cr3; asm volatile("mov %%cr3, %0" : "=r" (cr3)); asm volatile("mov %0, %%cr3" : : "r" (cr3)); } // Context switch TLB handling void switch_mm(struct mm_struct *prev, struct mm_struct *next) { // Load new page table base load_cr3(next->pgd); // ASID/PCID prevents full TLB flush on modern CPUs if (cpu_has_pcid) { write_cr3(__pa(next->pgd) | next->context.ctx_id); } }

Memory Allocation

Kernel Memory Allocation

// kmalloc - allocate contiguous physical memory void *ptr = kmalloc(size, GFP_KERNEL); kfree(ptr); // vmalloc - allocate virtually contiguous memory void *ptr = vmalloc(size); // May not be physically contiguous vfree(ptr); // Page allocation struct page *page = alloc_pages(GFP_KERNEL, order); __free_pages(page, order); // Slab allocator for fixed-size objects struct kmem_cache *cache = kmem_cache_create("my_cache", sizeof(struct my_struct), 0, SLAB_HWCACHE_ALIGN, NULL); void *obj = kmem_cache_alloc(cache, GFP_KERNEL); kmem_cache_free(cache, obj);

User Space Allocation

// brk/sbrk - extend heap void *old_brk = sbrk(0); // Get current break sbrk(4096); // Extend by 4KB void *new_brk = sbrk(0); // New break position // mmap - map memory void *addr = mmap(NULL, // Kernel chooses address length, // Size PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, // No file 0); // Offset munmap(addr, length); // Memory locking mlock(addr, length); // Lock in RAM (no swap) munlock(addr, length); mlockall(MCL_CURRENT | MCL_FUTURE); // Lock all

Demand Paging & Page Faults

Page Fault Handler

// Simplified page fault handler void do_page_fault(struct pt_regs *regs, unsigned long error_code) { unsigned long address = read_cr2(); // Faulting address struct mm_struct *mm = current->mm; struct vm_area_struct *vma; // Find VMA containing address vma = find_vma(mm, address); if (!vma || vma->vm_start > address) { // Segmentation fault send_sig(SIGSEGV, current, 0); return; } // Check permissions if (error_code & PF_WRITE) { if (!(vma->vm_flags & VM_WRITE)) goto bad_area; } // Handle fault based on type if (!pte_present(*pte)) { // Not present - could be swapped if (pte_swapped(*pte)) { do_swap_page(vma, address, pte); } else { do_anonymous_page(vma, address, pte); } } else if (pte_write(*pte)) { // Copy-on-write do_wp_page(vma, address, pte); } }

Types of Page Faults

# Monitor page faults perf stat -e page-faults,minor-faults,major-faults command # Types: # Minor fault - page in memory, just needs mapping # Major fault - page must be loaded from disk # Invalid fault - segmentation fault # View page fault stats cat /proc/vmstat | grep pgfault

Swapping & Page Replacement

Swap Management

# View swap usage free -h swapon --show # Create swap file dd if=/dev/zero of=/swapfile bs=1G count=4 chmod 600 /swapfile mkswap /swapfile swapon /swapfile # Swappiness (0-100, default 60) cat /proc/sys/vm/swappiness echo 10 > /proc/sys/vm/swappiness # Prefer RAM # View swap statistics vmstat 1 # si - swap in (KB/s) # so - swap out (KB/s)

Page Replacement Algorithms

// Linux uses a variant of LRU (Least Recently Used) // Two lists: active and inactive struct lruvec { struct list_head lists[NR_LRU_LISTS]; // LRU_INACTIVE_ANON - Inactive anonymous pages // LRU_ACTIVE_ANON - Active anonymous pages // LRU_INACTIVE_FILE - Inactive file pages // LRU_ACTIVE_FILE - Active file pages // LRU_UNEVICTABLE - Cannot be evicted }; // Page reclaim algorithm (simplified) void shrink_page_list(struct list_head *page_list) { list_for_each_entry(page, page_list, lru) { if (PageActive(page)) continue; // Skip active pages if (PageDirty(page)) { if (pageout(page) == PAGE_SUCCESS) continue; // Written to swap } if (page_mapped(page)) try_to_unmap(page); // Remove mappings if (page_count(page) == 0) free_page(page); // Reclaim page } }

Memory Zones

Linux divides physical memory into zones:

enum zone_type { ZONE_DMA, // 0-16MB (ISA DMA) ZONE_DMA32, // 0-4GB (32-bit DMA) ZONE_NORMAL, // Normal memory ZONE_HIGHMEM, // >896MB on 32-bit ZONE_MOVABLE, // Movable pages ZONE_DEVICE, // Device memory }; // View zones cat /proc/zoneinfo // Zone watermarks struct zone { unsigned long watermark[NR_WMARK]; // WMARK_MIN - Minimum free pages // WMARK_LOW - Start background reclaim // WMARK_HIGH - Stop background reclaim };

Copy-on-Write (COW)

COW delays copying until necessary:

// Fork creates COW mappings pid_t fork() { // Parent and child share pages // Pages marked read-only // Copy happens on first write } // COW page fault handler void do_wp_page(struct vm_area_struct *vma, unsigned long address) { struct page *old_page = pte_page(*pte); struct page *new_page; // Check if we&apos;re the only user if (page_count(old_page) == 1) { // We own it - just make writable pte = pte_mkwrite(*pte); return; } // Need to copy new_page = alloc_page(GFP_KERNEL); copy_page(new_page, old_page); // Update page table *pte = mk_pte(new_page, vma->vm_page_prot); *pte = pte_mkwrite(*pte); }

Huge Pages

Reduce TLB pressure with larger pages:

# Transparent Huge Pages (THP) cat /sys/kernel/mm/transparent_hugepage/enabled echo always > /sys/kernel/mm/transparent_hugepage/enabled # Static huge pages echo 100 > /proc/sys/vm/nr_hugepages mount -t hugetlbfs none /mnt/hugepages # View huge page info cat /proc/meminfo | grep Huge
// Allocate huge pages void *addr = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS | MAP_HUGETLB, -1, 0); // Check if using huge pages cat /proc/[pid]/smaps | grep AnonHugePages

OOM Killer

When memory is exhausted:

// OOM score calculation int oom_badness(struct task_struct *p) { long points; // Base score on memory usage points = get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS); // Adjust for oom_score_adj (-1000 to 1000) points *= (1000 + p->signal->oom_score_adj); points /= 1000; return points; } // Prevent OOM killer echo -1000 > /proc/[pid]/oom_score_adj # Disable echo 1000 > /proc/[pid]/oom_score_adj # First to die

Memory Debugging

Tools and Techniques

# Valgrind - memory leak detection valgrind --leak-check=full ./program # AddressSanitizer gcc -fsanitize=address -g program.c ./a.out # System memory info cat /proc/meminfo vmstat -s smem -tk # Per-process memory # Memory maps pmap -x [pid] cat /proc/[pid]/smaps_rollup

Common Issues

// Use-after-free char *ptr = malloc(100); free(ptr); *ptr = &apos;x&apos;; // BUG! // Double free free(ptr); free(ptr); // BUG! // Memory leak while (1) { malloc(100); // Never freed } // Stack overflow void recursive() { char large[1000000]; // Too big for stack recursive(); }

Best Practices

  1. Use appropriate allocators - kmalloc for small, vmalloc for large
  2. Lock critical pages - Prevent swapping of sensitive data
  3. Monitor memory pressure - Watch for excessive swapping
  4. Use huge pages - For large memory applications
  5. Handle OOM gracefully - Set appropriate oom_score_adj
  6. Profile memory usage - Find and fix leaks early
  7. Understand NUMA - Optimize for memory locality
  8. Tune vm parameters - Adjust swappiness, dirty ratios

Conclusion

Linux memory management is a masterpiece of engineering, seamlessly juggling virtual addresses, physical pages, and storage to create the illusion of infinite, private memory for each process. From the lightning-fast TLB to the clever COW optimization, every component works in harmony.

The interactive visualizations showed how addresses translate, pages swap, and memory pressure triggers the delicate dance of page replacement. Understanding these mechanisms helps you write more efficient programs and debug the trickiest memory issues.

Remember: in Linux, memory is virtual, pages are lazy, and the OOM killer is always watching.

Next: System Calls → ← Back to Process Management

If you found this explanation helpful, consider sharing it with others.

Mastodon