Page Migration & Fault Handling

16 min

Understanding virtual memory page migration, fault handling, and TLB management in CPU-GPU systems

Best viewed on desktop for optimal interactive experience

Page Migration & Fault Handling

Page migration is the process of moving memory pages between different memory regions (CPU ↔ GPU) in response to access patterns, enabling efficient memory utilization in heterogeneous systems.

Interactive Page Migration Visualization

Page Migration & Fault Handling

Virtual memory page migration between CPU and GPU

Migration Pipeline

0
Idle
System operating normally
1
Page Fault
Accessor requests unmapped page
2
Fault Handler
OS/Driver handles page fault
3
Page Allocation
Allocate page in target memory
4
Data Transfer
Copy page data over PCIe/memory bus
5
TLB Update
Update Translation Lookaside Buffer
6
Resume
Resume execution with mapped page

Migration Timing

Page Size4KB
Transfer Time0.13 μs
PCIe Bandwidth32 GB/s
Total Latency~1.1 μs

Fault Types

Minor Fault

Page exists in memory but not mapped. Fast resolution by updating page tables.

Major Fault

Page not in memory. Requires allocation and data transfer from backing store.

Page Fault Handler Implementation

// GPU page fault handler
void handle_gpu_page_fault(uint64_t fault_addr, int gpu_id) {
    // 1. Identify faulting page
    uint64_t page_num = fault_addr >> PAGE_SHIFT;
    
    // 2. Check if minor or major fault
    if (page_exists_in_cpu_memory(page_num)) {
        // Minor fault - page exists, just map it
        map_page_to_gpu(page_num, gpu_id);
    } else {
        // Major fault - allocate and transfer
        void* cpu_page = allocate_cpu_page();
        void* gpu_page = allocate_gpu_page(gpu_id);
        
        // 3. Transfer data
        dma_transfer(cpu_page, gpu_page, PAGE_SIZE);
        
        // 4. Update page tables
        update_gpu_page_table(page_num, gpu_page);
    }
    
    // 5. Update TLB
    invalidate_tlb_entry(fault_addr);
    
    // 6. Resume GPU execution
    resume_gpu_warp(gpu_id);
}

Page Fault Fundamentals

Types of Page Faults

Minor Page Fault

// Minor fault - page exists but not mapped void handle_minor_fault(uint64_t virtual_addr) { // Page is in memory, just not in page table struct page *page = find_page_in_memory(virtual_addr); if (page) { // Simply update page table entry update_page_table(virtual_addr, page->physical_addr); // Update access bits page->accessed = 1; // Return quickly - minimal overhead return; // ~100-500 ns } }

Major Page Fault

// Major fault - page not in memory void handle_major_fault(uint64_t virtual_addr) { // Allocate new page struct page *new_page = allocate_page(); // Load from backing store or migrate if (has_backing_store(virtual_addr)) { // Read from disk/SSD read_from_disk(virtual_addr, new_page); // ~100 μs - 10 ms } else { // Migrate from other memory migrate_page(virtual_addr, new_page); // ~1-10 μs } // Update page table update_page_table(virtual_addr, new_page->physical_addr); }

Page Fault Handler Architecture

Hardware Detection

// CPU Page Fault Exception Handler void page_fault_handler(struct pt_regs *regs) { uint64_t fault_addr = read_cr2(); // Faulting address uint64_t error_code = regs->error_code; // Decode fault type bool present = error_code & 0x1; // Page present? bool write = error_code & 0x2; // Write access? bool user = error_code & 0x4; // User mode? bool reserved = error_code & 0x8; // Reserved bit? bool ifetch = error_code & 0x10; // Instruction fetch? if (!present) { // Page not present - handle fault if (is_valid_address(fault_addr)) { handle_valid_fault(fault_addr, write); } else { // Segmentation fault send_sigsegv(current_task); } } else if (write && !is_writable(fault_addr)) { // Copy-on-write handle_cow_fault(fault_addr); } }

GPU Page Fault Handling

// NVIDIA GPU page fault handler (simplified) class GPUPageFaultHandler { private: struct FaultInfo { uint64_t address; uint32_t gpu_id; uint32_t sm_id; uint32_t warp_id; bool is_write; bool is_atomic; }; public: void handle_fault(FaultInfo fault) { // 1. Suspend faulting warp suspend_warp(fault.gpu_id, fault.sm_id, fault.warp_id); // 2. Check fault type if (is_migration_candidate(fault.address)) { // Migrate page to GPU migrate_to_gpu(fault.address); } else if (is_remote_access(fault.address)) { // Setup remote mapping setup_peer_mapping(fault.address); } else { // Invalid access raise_gpu_exception(fault); } // 3. Resume warp resume_warp(fault.gpu_id, fault.sm_id, fault.warp_id); } void migrate_to_gpu(uint64_t addr) { // Get page info size_t page_size = get_page_size(addr); void* cpu_addr = get_cpu_address(addr); // Allocate GPU page void* gpu_addr = allocate_gpu_page(page_size); // DMA transfer cudaMemcpyAsync(gpu_addr, cpu_addr, page_size, cudaMemcpyHostToDevice, migration_stream); // Update page tables update_gpu_page_table(addr, gpu_addr); invalidate_cpu_mapping(addr); } };

Page Migration Mechanisms

DMA Transfer Engine

// DMA engine for page migration class DMAEngine { private: struct DMADescriptor { uint64_t src_addr; uint64_t dst_addr; uint32_t size; uint32_t flags; }; queue<DMADescriptor> dma_queue; public: void migrate_page(void* src, void* dst, size_t size) { DMADescriptor desc = { .src_addr = (uint64_t)src, .dst_addr = (uint64_t)dst, .size = size, .flags = DMA_MIGRATE | DMA_COHERENT }; // Queue DMA transfer dma_queue.push(desc); // Start transfer start_dma_transfer(); // Wait for completion (or async) wait_dma_complete(); } void start_dma_transfer() { while (!dma_queue.empty()) { auto desc = dma_queue.front(); dma_queue.pop(); // Program DMA controller write_dma_src(desc.src_addr); write_dma_dst(desc.dst_addr); write_dma_size(desc.size); write_dma_control(DMA_START | desc.flags); } } };

Migration Policies

class MigrationPolicy: def __init__(self): self.access_counters = {} self.migration_threshold = 3 def should_migrate(self, page_id, accessor): """Determine if page should be migrated""" # Update access counter if page_id not in self.access_counters: self.access_counters[page_id] = {'cpu': 0, 'gpu': 0} self.access_counters[page_id][accessor] += 1 # Check migration criteria counts = self.access_counters[page_id] # Threshold-based migration if counts[accessor] >= self.migration_threshold: current_location = self.get_page_location(page_id) if current_location != accessor: return True, accessor # Ratio-based migration total = counts['cpu'] + counts['gpu'] if total > 10: gpu_ratio = counts['gpu'] / total if gpu_ratio > 0.8 and self.get_page_location(page_id) == 'cpu': return True, 'gpu' elif gpu_ratio < 0.2 and self.get_page_location(page_id) == 'gpu': return True, 'cpu' return False, None

TLB Management

Translation Lookaside Buffer

// TLB structure and management struct TLBEntry { uint64_t virtual_page; uint64_t physical_page; uint8_t asid; // Address Space ID uint8_t permissions; // R/W/X bool valid; bool global; uint32_t age; // For LRU replacement }; class TLBManager { private: static const int TLB_SIZE = 1024; TLBEntry tlb[TLB_SIZE]; public: // TLB lookup bool lookup(uint64_t vaddr, uint64_t& paddr) { uint64_t vpn = vaddr >> PAGE_SHIFT; int index = hash(vpn) % TLB_SIZE; // Direct-mapped lookup if (tlb[index].valid && tlb[index].virtual_page == vpn) { paddr = (tlb[index].physical_page << PAGE_SHIFT) | (vaddr & PAGE_MASK); tlb[index].age = 0; // Reset age return true; // TLB hit } return false; // TLB miss } // TLB insert void insert(uint64_t vaddr, uint64_t paddr) { uint64_t vpn = vaddr >> PAGE_SHIFT; uint64_t ppn = paddr >> PAGE_SHIFT; int index = hash(vpn) % TLB_SIZE; tlb[index] = { .virtual_page = vpn, .physical_page = ppn, .valid = true, .age = 0 }; } // TLB invalidation void invalidate(uint64_t vaddr) { uint64_t vpn = vaddr >> PAGE_SHIFT; int index = hash(vpn) % TLB_SIZE; if (tlb[index].virtual_page == vpn) { tlb[index].valid = false; } } // Flush entire TLB void flush() { for (int i = 0; i < TLB_SIZE; i++) { tlb[i].valid = false; } } // Selective TLB shootdown void shootdown(uint64_t vaddr, cpu_set_t cpus) { // Send IPI to other CPUs for_each_cpu(cpu, cpus) { send_ipi(cpu, IPI_TLB_FLUSH, vaddr); } // Invalidate local TLB invalidate(vaddr); } };

TLB Coherency

// TLB shootdown protocol void tlb_shootdown_handler(void* info) { struct tlb_flush_info* flush = (struct tlb_flush_info*)info; if (flush->flush_all) { // Flush entire TLB __flush_tlb_all(); } else { // Flush specific entries for (uint64_t addr = flush->start; addr < flush->end; addr += PAGE_SIZE) { __flush_tlb_one(addr); } } } // Initiate TLB shootdown void flush_tlb_range(uint64_t start, uint64_t end) { struct tlb_flush_info info = { .start = start, .end = end, .flush_all = false }; // Flush local TLB for (uint64_t addr = start; addr < end; addr += PAGE_SIZE) { __flush_tlb_one(addr); } // Send IPI to other CPUs on_each_cpu(tlb_shootdown_handler, &info, 1); }

Page Size Considerations

Multiple Page Sizes

class PageSizeManager: def __init__(self): self.page_sizes = { 'small': 4 * 1024, # 4 KB 'medium': 64 * 1024, # 64 KB 'large': 2 * 1024 * 1024, # 2 MB 'huge': 1024 * 1024 * 1024 # 1 GB } def select_page_size(self, allocation_size, access_pattern): """Select optimal page size based on allocation""" # For large allocations, use huge pages if allocation_size >= 1024 * 1024 * 1024: return 'huge' # For sequential access, prefer larger pages if access_pattern == 'sequential': if allocation_size >= 2 * 1024 * 1024: return 'large' elif allocation_size >= 64 * 1024: return 'medium' # For random access, use smaller pages if access_pattern == 'random': return 'small' # Default based on size if allocation_size >= 2 * 1024 * 1024: return 'large' elif allocation_size >= 64 * 1024: return 'medium' else: return 'small' def calculate_migration_cost(self, page_size, bandwidth): """Calculate migration time for different page sizes""" size_bytes = self.page_sizes[page_size] # PCIe transfer time transfer_time = size_bytes / bandwidth # Add fixed overhead overhead = 1e-6 # 1 microsecond return { 'page_size': page_size, 'size_bytes': size_bytes, 'transfer_time_us': transfer_time * 1e6, 'total_time_us': (transfer_time + overhead) * 1e6 }

Performance Optimization

Batched Migration

// Batch multiple page migrations class BatchedMigration { private: struct MigrationBatch { vector<void*> src_pages; vector<void*> dst_pages; vector<size_t> sizes; }; public: void batch_migrate(MigrationBatch& batch) { // Use multiple streams for parallel transfers const int num_streams = 4; cudaStream_t streams[num_streams]; for (int i = 0; i < num_streams; i++) { cudaStreamCreate(&streams[i]); } // Distribute pages across streams for (size_t i = 0; i < batch.src_pages.size(); i++) { int stream_id = i % num_streams; cudaMemcpyAsync(batch.dst_pages[i], batch.src_pages[i], batch.sizes[i], cudaMemcpyHostToDevice, streams[stream_id]); } // Synchronize all streams for (int i = 0; i < num_streams; i++) { cudaStreamSynchronize(streams[i]); cudaStreamDestroy(streams[i]); } } void prefetch_pages(vector<void*> pages, int device) { // Prefetch pages before they're needed for (void* page : pages) { cudaMemPrefetchAsync(page, PAGE_SIZE, device, 0); } } };

Fault Prediction

class FaultPredictor: def __init__(self): self.access_history = [] self.stride_detector = StrideDetector() def predict_next_fault(self, current_fault_addr): """Predict next likely page fault""" self.access_history.append(current_fault_addr) # Detect stride pattern stride = self.stride_detector.detect(self.access_history) if stride: # Stride-based prediction next_addr = current_fault_addr + stride return next_addr >> PAGE_SHIFT # Sequential prediction return (current_fault_addr >> PAGE_SHIFT) + 1 def prefetch_predicted(self, predicted_pages): """Prefetch predicted pages""" for page in predicted_pages: if not self.is_present(page): self.async_fetch(page) class StrideDetector: def detect(self, addresses): """Detect stride access pattern""" if len(addresses) < 3: return None # Calculate differences diffs = [addresses[i+1] - addresses[i] for i in range(len(addresses)-1)] # Check for consistent stride if len(set(diffs[-3:])) == 1: return diffs[-1] return None

Monitoring and Profiling

Page Fault Statistics

// Page fault statistics collection struct PageFaultStats { atomic_long minor_faults; atomic_long major_faults; atomic_long cow_faults; atomic_long migrations; atomic_long tlb_flushes; // Per-page size statistics atomic_long faults_4k; atomic_long faults_2m; atomic_long faults_1g; // Timing statistics uint64_t total_fault_time; uint64_t max_fault_time; uint64_t min_fault_time; }; void update_fault_stats(PageFaultStats* stats, FaultType type, uint64_t duration) { switch(type) { case MINOR_FAULT: atomic_long_inc(&stats->minor_faults); break; case MAJOR_FAULT: atomic_long_inc(&stats->major_faults); break; case COW_FAULT: atomic_long_inc(&stats->cow_faults); break; } // Update timing stats->total_fault_time += duration; stats->max_fault_time = max(stats->max_fault_time, duration); stats->min_fault_time = min(stats->min_fault_time, duration); }

Best Practices

1. Minimize Page Faults

  • Use memory prefetching
  • Align data structures to page boundaries
  • Group related data together

2. Optimize Page Size

  • Use huge pages for large allocations
  • Consider TLB coverage
  • Balance between internal fragmentation and TLB misses

3. Efficient Migration

  • Batch multiple migrations
  • Use asynchronous transfers
  • Implement smart migration policies

4. TLB Management

  • Minimize TLB shootdowns
  • Use PCID/ASID for context switches
  • Consider TLB-friendly data layouts

Common Issues and Solutions

Issue 1: Migration Thrashing

Problem: Pages constantly migrating back and forth Solution: Implement hysteresis in migration policy

Issue 2: TLB Pressure

Problem: Too many TLB misses Solution: Use huge pages, improve locality

Issue 3: Fault Storms

Problem: Many simultaneous page faults Solution: Batch handling, prefetching

Conclusion

Page migration and fault handling are critical components of modern heterogeneous memory systems. Understanding the mechanisms—from hardware fault detection through TLB management to DMA transfers—enables developers to optimize applications for minimal overhead and maximum performance in CPU-GPU unified memory architectures.

If you found this explanation helpful, consider sharing it with others.

Mastodon