CUDA Unified Memory
Unified virtual address space enabling seamless CPU-GPU memory sharing with automatic page migration
Best viewed on desktop for optimal interactive experience
CUDA Unified Memory Architecture
Unified Memory provides a single, unified virtual address space accessible from both CPU and GPU, simplifying memory management while enabling automatic data migration between processors.
Interactive Unified Memory Visualization
CUDA Unified Memory Architecture
Unified virtual address space with automatic page migration
Page Status
Access Pattern Details
CPU accessing CPU-resident memory
Pages migrate on fault
Unified Memory Code Examples
Basic Allocation
// Unified Memory allocation
float* data;
cudaMallocManaged(&data, size);
// Both CPU and GPU can access
data[0] = 1.0f; // CPU write
kernel<<<grid,block>>>(data); // GPU access
cudaFree(data);
Prefetching & Hints
// Prefetch to GPU
cudaMemPrefetchAsync(data, size,
deviceId, stream);
// Advise preferred location
cudaMemAdvise(data, size,
cudaMemAdviseSetPreferredLocation,
deviceId);
Performance Characteristics
Understanding Unified Memory
Traditional vs Unified Memory Model
Traditional CUDA Memory Model
// Traditional explicit memory management float *h_data = (float*)malloc(size); float *d_data; cudaMalloc(&d_data, size); // Initialize on CPU for (int i = 0; i < n; i++) { h_data[i] = i; } // Explicit copy to GPU cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice); // Launch kernel kernel<<<grid, block>>>(d_data, n); // Explicit copy back cudaMemcpy(h_data, d_data, size, cudaMemcpyDeviceToHost); // Cleanup free(h_data); cudaFree(d_data);
Unified Memory Model
// Unified Memory - single allocation float *data; cudaMallocManaged(&data, size); // Initialize on CPU - no explicit copy needed for (int i = 0; i < n; i++) { data[i] = i; } // Launch kernel - automatic migration kernel<<<grid, block>>>(data, n); cudaDeviceSynchronize(); // CPU can access results directly printf("Result: %f\n", data[0]); // Single free cudaFree(data);
Page Migration Mechanism
Page Fault Handling
class PageFaultHandler: def __init__(self): self.page_size = 4096 # 4KB pages self.migration_granularity = 64 * 1024 # 64KB chunks def handle_gpu_page_fault(self, virtual_addr, gpu_id): """Handle page fault when GPU accesses CPU memory""" # 1. Identify faulting page page_id = virtual_addr >> 12 # 4KB pages # 2. Check page location if self.page_table[page_id].location == 'CPU': # 3. Suspend accessing warp self.suspend_warp(gpu_id, warp_id) # 4. Initiate DMA transfer self.migrate_page_to_gpu(page_id) # 5. Update page tables self.update_page_tables(page_id, 'GPU') # 6. Resume warp execution self.resume_warp(gpu_id, warp_id) def migrate_page_to_gpu(self, page_id): """Migrate page from CPU to GPU memory""" # Calculate addresses cpu_addr = self.get_cpu_physical_addr(page_id) gpu_addr = self.allocate_gpu_page() # DMA transfer over PCIe self.dma_transfer(cpu_addr, gpu_addr, self.page_size) # Update mappings self.gpu_page_table[page_id] = gpu_addr self.cpu_page_table[page_id] = None # Invalidate CPU mapping
Migration Granularity
// Different migration granularities struct MigrationConfig { enum Granularity { PAGE_4KB = 4096, CHUNK_64KB = 65536, BLOCK_2MB = 2097152 }; size_t get_migration_size(void* addr, size_t access_size) { // Determine optimal migration granularity if (access_size < PAGE_4KB) { return PAGE_4KB; // Minimum migration } else if (access_size < CHUNK_64KB) { return CHUNK_64KB; // Common case } else { return BLOCK_2MB; // Large transfers } } };
Page Table Architecture
Multi-Level Page Tables
// x86-64 style 4-level page table for GPU struct PageTableEntry { uint64_t present : 1; // Page in memory? uint64_t writable : 1; // Write permission uint64_t user : 1; // User mode access uint64_t write_through : 1; // Cache policy uint64_t cache_disable : 1; // Cache policy uint64_t accessed : 1; // Recently accessed uint64_t dirty : 1; // Modified uint64_t large_page : 1; // 2MB/1GB page uint64_t global : 1; // Global page uint64_t available : 3; // OS-specific uint64_t physical_addr : 40; // Physical page number uint64_t reserved : 11; // Reserved uint64_t nx : 1; // No execute }; // Virtual address translation uint64_t translate_address(uint64_t virtual_addr) { // Extract indices for each level uint64_t pml4_idx = (virtual_addr >> 39) & 0x1FF; uint64_t pdpt_idx = (virtual_addr >> 30) & 0x1FF; uint64_t pd_idx = (virtual_addr >> 21) & 0x1FF; uint64_t pt_idx = (virtual_addr >> 12) & 0x1FF; uint64_t offset = virtual_addr & 0xFFF; // Walk page tables PageTableEntry* pml4 = get_pml4_base(); if (!pml4[pml4_idx].present) { trigger_page_fault(virtual_addr); } PageTableEntry* pdpt = (PageTableEntry*)(pml4[pml4_idx].physical_addr << 12); PageTableEntry* pd = (PageTableEntry*)(pdpt[pdpt_idx].physical_addr << 12); PageTableEntry* pt = (PageTableEntry*)(pd[pd_idx].physical_addr << 12); return (pt[pt_idx].physical_addr << 12) | offset; }
Memory Coherency Models
GPU Memory Coherency
class MemoryCoherency { public: enum Scope { CTA, // Thread block coherent GPU, // Device coherent SYSTEM // System-wide coherent }; // Memory fence operations __device__ void memory_fence(Scope scope) { switch(scope) { case CTA: __syncthreads(); // Block-level synchronization break; case GPU: __threadfence(); // Device-level fence break; case SYSTEM: __threadfence_system(); // System-wide fence break; } } // Atomic operations with scope __device__ int atomic_add_scoped(int* addr, int val, Scope scope) { switch(scope) { case CTA: return atomicAdd_block(addr, val); case GPU: return atomicAdd(addr, val); case SYSTEM: return atomicAdd_system(addr, val); } } };
Cache Coherency Protocol
class CacheCoherencyProtocol: """MOESI-style coherency protocol for CPU-GPU""" def __init__(self): self.states = { 'M': 'Modified', # Exclusive, dirty 'O': 'Owned', # Shared, dirty 'E': 'Exclusive', # Exclusive, clean 'S': 'Shared', # Shared, clean 'I': 'Invalid' # Invalid } def handle_read_request(self, cache_line, requester): current_state = cache_line.state if current_state == 'I': # Miss - fetch from memory self.fetch_from_memory(cache_line) cache_line.state = 'E' if self.is_exclusive() else 'S' elif current_state in ['M', 'O']: # Another cache has dirty copy self.write_back(cache_line) cache_line.state = 'S' return cache_line.data def handle_write_request(self, cache_line, requester): # Invalidate other copies self.broadcast_invalidate(cache_line.address) # Update state to Modified cache_line.state = 'M' cache_line.dirty = True
Optimization Techniques
Memory Prefetching
// Prefetching strategies class PrefetchManager { public: // Explicit prefetching void prefetch_to_gpu(void* ptr, size_t size, int device) { cudaMemPrefetchAsync(ptr, size, device, 0); } // Hint-based optimization void set_access_hints(void* ptr, size_t size) { // Advise read-mostly data cudaMemAdvise(ptr, size, cudaMemAdviseSetReadMostly, 0); // Set preferred location cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId); // Mark as accessed by GPU cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, 0); // GPU 0 } // Adaptive prefetching based on access patterns void adaptive_prefetch(void* ptr, size_t size, AccessPattern pattern) { switch(pattern) { case SEQUENTIAL: // Prefetch next pages prefetch_ahead(ptr, size, 4); // 4 pages ahead break; case STRIDED: // Prefetch based on stride prefetch_strided(ptr, size, pattern.stride); break; case RANDOM: // No prefetching for random access break; } } };
Migration Policies
class MigrationPolicy: def __init__(self): self.policies = { 'first_touch': self.first_touch, 'preferred_location': self.preferred_location, 'access_counting': self.access_counting, 'bandwidth_aware': self.bandwidth_aware } def first_touch(self, page, accessor): """Allocate page where first touched""" if page.state == 'UNALLOCATED': if accessor == 'CPU': return self.allocate_cpu_page(page) else: return self.allocate_gpu_page(page) return page.location def access_counting(self, page, accessor): """Migrate based on access frequency""" page.access_count[accessor] += 1 # Calculate access ratio total = sum(page.access_count.values()) ratio = page.access_count[accessor] / total # Migrate if accessor dominates if ratio > 0.8 and page.location != accessor: return self.migrate_page(page, accessor) return page.location def bandwidth_aware(self, page, accessor): """Consider available bandwidth""" available_bw = self.get_pcie_bandwidth() migration_cost = page.size / available_bw # Only migrate if cost is acceptable if migration_cost < self.threshold: return self.migrate_page(page, accessor) return page.location
Performance Analysis
Migration Overhead
def calculate_migration_overhead(page_size, pcie_gen, num_migrations): """Calculate time spent in page migrations""" # PCIe bandwidth pcie_bandwidth = { 3: 15.75e9, # Gen3 x16 4: 31.5e9, # Gen4 x16 5: 63e9 # Gen5 x16 } bandwidth = pcie_bandwidth[pcie_gen] # Migration time per page latency = 1e-6 # 1 microsecond latency transfer_time = page_size / bandwidth total_time_per_migration = latency + transfer_time # Total overhead total_overhead = num_migrations * total_time_per_migration return { 'per_migration_us': total_time_per_migration * 1e6, 'total_overhead_ms': total_overhead * 1e3, 'bandwidth_utilized': page_size * num_migrations / total_overhead } # Example: 1000 4KB page migrations over PCIe Gen4 overhead = calculate_migration_overhead(4096, 4, 1000) print(f"Per migration: {overhead['per_migration_us']:.2f} μs") print(f"Total overhead: {overhead['total_overhead_ms']:.2f} ms")
Memory Access Patterns
// Optimizing for Unified Memory access patterns template<typename T> __global__ void optimized_kernel(T* data, int n) { // Coalesced access pattern int tid = blockIdx.x * blockDim.x + threadIdx.x; int stride = gridDim.x * blockDim.x; // Process multiple elements per thread for (int i = tid; i < n; i += stride) { // Ensure coalesced memory access T val = data[i]; // Computation val = process(val); // Coalesced write back data[i] = val; } } // Launch configuration for optimal memory access void launch_optimized(float* data, int n) { // Calculate optimal block and grid sizes int blockSize = 256; int numBlocks = (n + blockSize - 1) / blockSize; // Limit grid size to avoid oversubscription numBlocks = min(numBlocks, 65536); // Prefetch data to GPU cudaMemPrefetchAsync(data, n * sizeof(float), 0); // Launch kernel optimized_kernel<<<numBlocks, blockSize>>>(data, n); // Prefetch results back to CPU if needed cudaMemPrefetchAsync(data, n * sizeof(float), cudaCpuDeviceId); }
Advanced Features
Memory Oversubscription
// Allocating more memory than physically available class OversubscriptionManager { private: size_t physical_memory; size_t virtual_allocated; public: void* allocate_oversubscribed(size_t size) { void* ptr; // Can allocate beyond physical memory cudaMallocManaged(&ptr, size); virtual_allocated += size; // System will page in/out as needed if (virtual_allocated > physical_memory) { printf("Oversubscribed by %.2f%%\n", 100.0 * (virtual_allocated - physical_memory) / physical_memory); } return ptr; } void optimize_for_oversubscription(void* ptr, size_t size) { // Use memory advise for better paging cudaMemAdvise(ptr, size, cudaMemAdviseSetPreferredLocation, cudaCpuDeviceId); // Keep in system memory // Allow GPU access without migration cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, 0); // GPU 0 } };
Multi-GPU Unified Memory
// Unified Memory across multiple GPUs class MultiGPUUnifiedMemory { public: void setup_multi_gpu(void* ptr, size_t size, int num_gpus) { // Enable peer access between GPUs for (int i = 0; i < num_gpus; i++) { cudaSetDevice(i); for (int j = 0; j < num_gpus; j++) { if (i != j) { cudaDeviceEnablePeerAccess(j, 0); } } } // Set memory as accessible by all GPUs for (int i = 0; i < num_gpus; i++) { cudaMemAdvise(ptr, size, cudaMemAdviseSetAccessedBy, i); } } void distribute_work(void* data, size_t size, int num_gpus) { size_t chunk_size = size / num_gpus; // Launch kernels on multiple GPUs for (int i = 0; i < num_gpus; i++) { cudaSetDevice(i); void* chunk_ptr = (char*)data + i * chunk_size; // Prefetch chunk to GPU cudaMemPrefetchAsync(chunk_ptr, chunk_size, i); // Launch kernel on this GPU process_kernel<<<grid, block>>>(chunk_ptr, chunk_size); } // Synchronize all GPUs for (int i = 0; i < num_gpus; i++) { cudaSetDevice(i); cudaDeviceSynchronize(); } } };
Best Practices
1. Access Pattern Optimization
- Use coalesced memory access patterns
- Minimize random access that triggers migrations
- Group accesses by processor (CPU or GPU)
2. Prefetching Strategy
- Prefetch data before kernel launch
- Use memory advise for access hints
- Consider async prefetching with streams
3. Migration Control
- Set preferred locations for frequently accessed data
- Use read-mostly hints for shared read data
- Control migration granularity based on access patterns
4. Performance Monitoring
// Profile Unified Memory behavior cudaError_t profile_unified_memory() { // Enable profiling cudaProfilerStart(); // Your Unified Memory code here // Stop profiling cudaProfilerStop(); // Use nvprof or Nsight Systems to analyze: // - Page fault frequency // - Migration counts // - Transfer sizes // - Memory throughput }
Common Pitfalls and Solutions
1. Thrashing
Problem: Frequent migrations between CPU and GPU Solution: Use prefetching and access hints
2. False Sharing
Problem: Multiple processors accessing same page Solution: Align data structures to page boundaries
3. Unexpected Migrations
Problem: Hidden CPU access triggers migrations Solution: Be aware of implicit CPU access (printf, etc.)
Conclusion
CUDA Unified Memory simplifies GPU programming by providing a single address space while maintaining performance through intelligent page migration and caching. Understanding the underlying mechanisms—page tables, migration policies, and coherency protocols—enables developers to write efficient code that leverages automatic memory management without sacrificing performance.
The key to success with Unified Memory is understanding access patterns and using the available hints and prefetching mechanisms to guide the runtime's migration decisions.