Page Migration & Fault Handling

Page migration is the process of moving memory pages between different memory regions (CPU ↔ GPU) in response to access patterns, enabling efficient memory utilization in heterogeneous systems.

Interactive Page Migration Visualization

Page Migration & Fault Handling

Virtual memory page migration between CPU and GPU

Fault Type

Page Size

Accessor

Migration Pipeline

Idle

System operating normally

Page Fault

Accessor requests unmapped page

Fault Handler

OS/Driver handles page fault

Page Allocation

Allocate page in target memory

Data Transfer

Copy page data over PCIe/memory bus

TLB Update

Update Translation Lookaside Buffer

Resume

Resume execution with mapped page

Migration Timing

Page Size4KB

Transfer Time0.13 μs

PCIe Bandwidth32 GB/s

Total Latency~1.1 μs

Fault Types

Minor Fault

Page exists in memory but not mapped. Fast resolution by updating page tables.

Major Fault

Page not in memory. Requires allocation and data transfer from backing store.

Page Fault Handler Implementation

// GPU page fault handler
void handle_gpu_page_fault(uint64_t fault_addr, int gpu_id) {
    // 1. Identify faulting page
    uint64_t page_num = fault_addr >> PAGE_SHIFT;
    
    // 2. Check if minor or major fault
    if (page_exists_in_cpu_memory(page_num)) {
        // Minor fault - page exists, just map it
        map_page_to_gpu(page_num, gpu_id);
    } else {
        // Major fault - allocate and transfer
        void* cpu_page = allocate_cpu_page();
        void* gpu_page = allocate_gpu_page(gpu_id);
        
        // 3. Transfer data
        dma_transfer(cpu_page, gpu_page, PAGE_SIZE);
        
        // 4. Update page tables
        update_gpu_page_table(page_num, gpu_page);
    }
    
    // 5. Update TLB
    invalidate_tlb_entry(fault_addr);
    
    // 6. Resume GPU execution
    resume_gpu_warp(gpu_id);
}

Page Fault Fundamentals

Types of Page Faults

Minor Page Fault

// Minor fault - page exists but not mapped
void handle_minor_fault(uint64_t virtual_addr) {
    // Page is in memory, just not in page table
    struct page *page = find_page_in_memory(virtual_addr);
    
    if (page) {
        // Simply update page table entry
        update_page_table(virtual_addr, page->physical_addr);
        
        // Update access bits
        page->accessed = 1;
        
        // Return quickly - minimal overhead
        return;  // ~100-500 ns
    }
}

Major Page Fault

// Major fault - page not in memory
void handle_major_fault(uint64_t virtual_addr) {
    // Allocate new page
    struct page *new_page = allocate_page();
    
    // Load from backing store or migrate
    if (has_backing_store(virtual_addr)) {
        // Read from disk/SSD
        read_from_disk(virtual_addr, new_page);  // ~100 μs - 10 ms
    } else {
        // Migrate from other memory
        migrate_page(virtual_addr, new_page);    // ~1-10 μs
    }
    
    // Update page table
    update_page_table(virtual_addr, new_page->physical_addr);
}

Page Fault Handler Architecture

Hardware Detection

// CPU Page Fault Exception Handler
void page_fault_handler(struct pt_regs *regs) {
    uint64_t fault_addr = read_cr2();  // Faulting address
    uint64_t error_code = regs->error_code;
    
    // Decode fault type
    bool present = error_code & 0x1;     // Page present?
    bool write = error_code & 0x2;       // Write access?
    bool user = error_code & 0x4;        // User mode?
    bool reserved = error_code & 0x8;    // Reserved bit?
    bool ifetch = error_code & 0x10;     // Instruction fetch?
    
    if (!present) {
        // Page not present - handle fault
        if (is_valid_address(fault_addr)) {
            handle_valid_fault(fault_addr, write);
        } else {
            // Segmentation fault
            send_sigsegv(current_task);
        }
    } else if (write && !is_writable(fault_addr)) {
        // Copy-on-write
        handle_cow_fault(fault_addr);
    }
}

GPU Page Fault Handling

// NVIDIA GPU page fault handler (simplified)
class GPUPageFaultHandler {
private:
    struct FaultInfo {
        uint64_t address;
        uint32_t gpu_id;
        uint32_t sm_id;
        uint32_t warp_id;
        bool is_write;
        bool is_atomic;
    };
    
public:
    void handle_fault(FaultInfo fault) {
        // 1. Suspend faulting warp
        suspend_warp(fault.gpu_id, fault.sm_id, fault.warp_id);
        
        // 2. Check fault type
        if (is_migration_candidate(fault.address)) {
            // Migrate page to GPU
            migrate_to_gpu(fault.address);
        } else if (is_remote_access(fault.address)) {
            // Setup remote mapping
            setup_peer_mapping(fault.address);
        } else {
            // Invalid access
            raise_gpu_exception(fault);
        }
        
        // 3. Resume warp
        resume_warp(fault.gpu_id, fault.sm_id, fault.warp_id);
    }
    
    void migrate_to_gpu(uint64_t addr) {
        // Get page info
        size_t page_size = get_page_size(addr);
        void* cpu_addr = get_cpu_address(addr);
        
        // Allocate GPU page
        void* gpu_addr = allocate_gpu_page(page_size);
        
        // DMA transfer
        cudaMemcpyAsync(gpu_addr, cpu_addr, page_size,
                       cudaMemcpyHostToDevice, migration_stream);
        
        // Update page tables
        update_gpu_page_table(addr, gpu_addr);
        invalidate_cpu_mapping(addr);
    }
};

Page Migration Mechanisms

DMA Transfer Engine

// DMA engine for page migration
class DMAEngine {
private:
    struct DMADescriptor {
        uint64_t src_addr;
        uint64_t dst_addr;
        uint32_t size;
        uint32_t flags;
    };
    
    queue<DMADescriptor> dma_queue;
    
public:
    void migrate_page(void* src, void* dst, size_t size) {
        DMADescriptor desc = {
            .src_addr = (uint64_t)src,
            .dst_addr = (uint64_t)dst,
            .size = size,
            .flags = DMA_MIGRATE | DMA_COHERENT
        };
        
        // Queue DMA transfer
        dma_queue.push(desc);
        
        // Start transfer
        start_dma_transfer();
        
        // Wait for completion (or async)
        wait_dma_complete();
    }
    
    void start_dma_transfer() {
        while (!dma_queue.empty()) {
            auto desc = dma_queue.front();
            dma_queue.pop();
            
            // Program DMA controller
            write_dma_src(desc.src_addr);
            write_dma_dst(desc.dst_addr);
            write_dma_size(desc.size);
            write_dma_control(DMA_START | desc.flags);
        }
    }
};

Migration Policies

class MigrationPolicy:
    def __init__(self):
        self.access_counters = {}
        self.migration_threshold = 3
        
    def should_migrate(self, page_id, accessor):
        """Determine if page should be migrated"""
        
        # Update access counter
        if page_id not in self.access_counters:
            self.access_counters[page_id] = {'cpu': 0, 'gpu': 0}
        
        self.access_counters[page_id][accessor] += 1
        
        # Check migration criteria
        counts = self.access_counters[page_id]
        
        # Threshold-based migration
        if counts[accessor] >= self.migration_threshold:
            current_location = self.get_page_location(page_id)
            if current_location != accessor:
                return True, accessor
        
        # Ratio-based migration
        total = counts['cpu'] + counts['gpu']
        if total > 10:
            gpu_ratio = counts['gpu'] / total
            if gpu_ratio > 0.8 and self.get_page_location(page_id) == 'cpu':
                return True, 'gpu'
            elif gpu_ratio < 0.2 and self.get_page_location(page_id) == 'gpu':
                return True, 'cpu'
        
        return False, None

TLB Management

Translation Lookaside Buffer

// TLB structure and management
struct TLBEntry {
    uint64_t virtual_page;
    uint64_t physical_page;
    uint8_t asid;           // Address Space ID
    uint8_t permissions;    // R/W/X
    bool valid;
    bool global;
    uint32_t age;          // For LRU replacement
};

class TLBManager {
private:
    static const int TLB_SIZE = 1024;
    TLBEntry tlb[TLB_SIZE];
    
public:
    // TLB lookup
    bool lookup(uint64_t vaddr, uint64_t& paddr) {
        uint64_t vpn = vaddr >> PAGE_SHIFT;
        int index = hash(vpn) % TLB_SIZE;
        
        // Direct-mapped lookup
        if (tlb[index].valid && 
            tlb[index].virtual_page == vpn) {
            paddr = (tlb[index].physical_page << PAGE_SHIFT) |
                   (vaddr & PAGE_MASK);
            tlb[index].age = 0;  // Reset age
            return true;  // TLB hit
        }
        
        return false;  // TLB miss
    }
    
    // TLB insert
    void insert(uint64_t vaddr, uint64_t paddr) {
        uint64_t vpn = vaddr >> PAGE_SHIFT;
        uint64_t ppn = paddr >> PAGE_SHIFT;
        int index = hash(vpn) % TLB_SIZE;
        
        tlb[index] = {
            .virtual_page = vpn,
            .physical_page = ppn,
            .valid = true,
            .age = 0
        };
    }
    
    // TLB invalidation
    void invalidate(uint64_t vaddr) {
        uint64_t vpn = vaddr >> PAGE_SHIFT;
        int index = hash(vpn) % TLB_SIZE;
        
        if (tlb[index].virtual_page == vpn) {
            tlb[index].valid = false;
        }
    }
    
    // Flush entire TLB
    void flush() {
        for (int i = 0; i < TLB_SIZE; i++) {
            tlb[i].valid = false;
        }
    }
    
    // Selective TLB shootdown
    void shootdown(uint64_t vaddr, cpu_set_t cpus) {
        // Send IPI to other CPUs
        for_each_cpu(cpu, cpus) {
            send_ipi(cpu, IPI_TLB_FLUSH, vaddr);
        }
        
        // Invalidate local TLB
        invalidate(vaddr);
    }
};

TLB Coherency

// TLB shootdown protocol
void tlb_shootdown_handler(void* info) {
    struct tlb_flush_info* flush = (struct tlb_flush_info*)info;
    
    if (flush->flush_all) {
        // Flush entire TLB
        __flush_tlb_all();
    } else {
        // Flush specific entries
        for (uint64_t addr = flush->start; 
             addr < flush->end; 
             addr += PAGE_SIZE) {
            __flush_tlb_one(addr);
        }
    }
}

// Initiate TLB shootdown
void flush_tlb_range(uint64_t start, uint64_t end) {
    struct tlb_flush_info info = {
        .start = start,
        .end = end,
        .flush_all = false
    };
    
    // Flush local TLB
    for (uint64_t addr = start; addr < end; addr += PAGE_SIZE) {
        __flush_tlb_one(addr);
    }
    
    // Send IPI to other CPUs
    on_each_cpu(tlb_shootdown_handler, &info, 1);
}

Page Size Considerations

Multiple Page Sizes

class PageSizeManager:
    def __init__(self):
        self.page_sizes = {
            'small': 4 * 1024,        # 4 KB
            'medium': 64 * 1024,      # 64 KB  
            'large': 2 * 1024 * 1024, # 2 MB
            'huge': 1024 * 1024 * 1024 # 1 GB
        }
        
    def select_page_size(self, allocation_size, access_pattern):
        """Select optimal page size based on allocation"""
        
        # For large allocations, use huge pages
        if allocation_size >= 1024 * 1024 * 1024:
            return 'huge'
        
        # For sequential access, prefer larger pages
        if access_pattern == 'sequential':
            if allocation_size >= 2 * 1024 * 1024:
                return 'large'
            elif allocation_size >= 64 * 1024:
                return 'medium'
        
        # For random access, use smaller pages
        if access_pattern == 'random':
            return 'small'
        
        # Default based on size
        if allocation_size >= 2 * 1024 * 1024:
            return 'large'
        elif allocation_size >= 64 * 1024:
            return 'medium'
        else:
            return 'small'
    
    def calculate_migration_cost(self, page_size, bandwidth):
        """Calculate migration time for different page sizes"""
        
        size_bytes = self.page_sizes[page_size]
        
        # PCIe transfer time
        transfer_time = size_bytes / bandwidth
        
        # Add fixed overhead
        overhead = 1e-6  # 1 microsecond
        
        return {
            'page_size': page_size,
            'size_bytes': size_bytes,
            'transfer_time_us': transfer_time * 1e6,
            'total_time_us': (transfer_time + overhead) * 1e6
        }

Performance Optimization

Batched Migration

// Batch multiple page migrations
class BatchedMigration {
private:
    struct MigrationBatch {
        vector<void*> src_pages;
        vector<void*> dst_pages;
        vector<size_t> sizes;
    };
    
public:
    void batch_migrate(MigrationBatch& batch) {
        // Use multiple streams for parallel transfers
        const int num_streams = 4;
        cudaStream_t streams[num_streams];
        
        for (int i = 0; i < num_streams; i++) {
            cudaStreamCreate(&streams[i]);
        }
        
        // Distribute pages across streams
        for (size_t i = 0; i < batch.src_pages.size(); i++) {
            int stream_id = i % num_streams;
            
            cudaMemcpyAsync(batch.dst_pages[i],
                          batch.src_pages[i],
                          batch.sizes[i],
                          cudaMemcpyHostToDevice,
                          streams[stream_id]);
        }
        
        // Synchronize all streams
        for (int i = 0; i < num_streams; i++) {
            cudaStreamSynchronize(streams[i]);
            cudaStreamDestroy(streams[i]);
        }
    }
    
    void prefetch_pages(vector<void*> pages, int device) {
        // Prefetch pages before they're needed
        for (void* page : pages) {
            cudaMemPrefetchAsync(page, PAGE_SIZE, device, 0);
        }
    }
};

Fault Prediction

class FaultPredictor:
    def __init__(self):
        self.access_history = []
        self.stride_detector = StrideDetector()
        
    def predict_next_fault(self, current_fault_addr):
        """Predict next likely page fault"""
        
        self.access_history.append(current_fault_addr)
        
        # Detect stride pattern
        stride = self.stride_detector.detect(self.access_history)
        
        if stride:
            # Stride-based prediction
            next_addr = current_fault_addr + stride
            return next_addr >> PAGE_SHIFT
        
        # Sequential prediction
        return (current_fault_addr >> PAGE_SHIFT) + 1
    
    def prefetch_predicted(self, predicted_pages):
        """Prefetch predicted pages"""
        for page in predicted_pages:
            if not self.is_present(page):
                self.async_fetch(page)

class StrideDetector:
    def detect(self, addresses):
        """Detect stride access pattern"""
        if len(addresses) < 3:
            return None
        
        # Calculate differences
        diffs = [addresses[i+1] - addresses[i] 
                for i in range(len(addresses)-1)]
        
        # Check for consistent stride
        if len(set(diffs[-3:])) == 1:
            return diffs[-1]
        
        return None

Monitoring and Profiling

Page Fault Statistics

// Page fault statistics collection
struct PageFaultStats {
    atomic_long minor_faults;
    atomic_long major_faults;
    atomic_long cow_faults;
    atomic_long migrations;
    atomic_long tlb_flushes;
    
    // Per-page size statistics
    atomic_long faults_4k;
    atomic_long faults_2m;
    atomic_long faults_1g;
    
    // Timing statistics
    uint64_t total_fault_time;
    uint64_t max_fault_time;
    uint64_t min_fault_time;
};

void update_fault_stats(PageFaultStats* stats, 
                       FaultType type, 
                       uint64_t duration) {
    switch(type) {
        case MINOR_FAULT:
            atomic_long_inc(&stats->minor_faults);
            break;
        case MAJOR_FAULT:
            atomic_long_inc(&stats->major_faults);
            break;
        case COW_FAULT:
            atomic_long_inc(&stats->cow_faults);
            break;
    }
    
    // Update timing
    stats->total_fault_time += duration;
    stats->max_fault_time = max(stats->max_fault_time, duration);
    stats->min_fault_time = min(stats->min_fault_time, duration);
}

Best Practices

1. Minimize Page Faults

Use memory prefetching
Align data structures to page boundaries
Group related data together

2. Optimize Page Size

Use huge pages for large allocations
Consider TLB coverage
Balance between internal fragmentation and TLB misses

3. Efficient Migration

Batch multiple migrations
Use asynchronous transfers
Implement smart migration policies

4. TLB Management

Minimize TLB shootdowns
Use PCID/ASID for context switches
Consider TLB-friendly data layouts

Common Issues and Solutions

Issue 1: Migration Thrashing

Problem: Pages constantly migrating back and forth Solution: Implement hysteresis in migration policy

Issue 2: TLB Pressure

Problem: Too many TLB misses Solution: Use huge pages, improve locality

Issue 3: Fault Storms

Problem: Many simultaneous page faults Solution: Batch handling, prefetching

Page migration and fault handling are critical components of modern heterogeneous memory systems. Understanding the mechanisms—from hardware fault detection through TLB management to DMA transfers—enables developers to optimize applications for minimal overhead and maximum performance in CPU-GPU unified memory architectures.

Page Migration & Fault Handling

Table of Contents

Page Migration & Fault Handling

Interactive Page Migration Visualization