NVIDIA Unified Virtual Memory
Automatic memory management between CPU and GPU through page faulting and on-demand migration
Best viewed on desktop for optimal interactive experience
Understanding NVIDIA Unified Virtual Memory
NVIDIA Unified Virtual Memory (UVM) is a memory management system that provides a single unified address space accessible from both CPU and GPU. Instead of manually copying data between host and device memory, UVM automatically migrates pages on-demand via page faulting. The system transparently handles memory oversubscription, allowing GPU memory to exceed physical VRAM by evicting pages to system RAM. This dramatically simplifies CUDA programming while enabling workloads larger than GPU memory.
The Manual Memory Management Problem
Traditional CUDA programming requires explicit memory management. Programmers must manually allocate separate memory spaces on CPU and GPU, then explicitly copy data between them. This creates substantial complexity and opportunities for error.
Code Comparison
Traditional CUDA requires verbose, error-prone code with explicit memory copies:
// Traditional CUDA: Manual Memory Management // Allocate on host float *h_data; h_data = (float*)malloc(N * sizeof(float)); // Allocate on device float *d_data; cudaMalloc(&d_data, N * sizeof(float)); // Initialize on host for(int i=0; i<N; i++) h_data[i] = i; // Copy to device cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice); // Launch kernel kernel<<<blocks, threads>>>(d_data); cudaDeviceSynchronize(); // Copy back to host cudaMemcpy(h_data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost); // Cleanup free(h_data); cudaFree(d_data);
15+ lines of boilerplate, with manual tracking of two separate pointers and explicit synchronization.
Unified Memory simplifies this dramatically:
// Unified Memory: Automatic Management // Single allocation float *data; cudaMallocManaged(&data, N * sizeof(float)); // Initialize (on CPU or GPU) for(int i=0; i<N; i++) data[i] = i; // Launch kernel // UVM migrates automatically! kernel<<<blocks, threads>>>(data); cudaDeviceSynchronize(); // Access results (on CPU or GPU) printf("Result: %f\n", data[0]); // Cleanup cudaFree(data);
7 lines, no explicit copies! UVM handles migration transparently based on access patterns.
How Unified Virtual Memory Works
Page Faulting and On-Demand Migration
UVM operates on the principle of demand paging, similar to OS virtual memory. When CPU or GPU accesses memory not currently present in its address space, a page fault occurs. The UVM driver intercepts this fault and migrates the required page.
The nvidia-uvm Kernel Module
Unified Virtual Memory is implemented by the nvidia-uvm.ko kernel module, which works in conjunction with the main NVIDIA driver (nvidia.ko). The UVM module maintains page tables that track where each page currently resides—CPU memory, GPU memory, or both.
# Check if nvidia-uvm module is loaded $ lsmod | grep nvidia nvidia_uvm 1327104 0 nvidia_drm 69632 4 nvidia_modeset 1302528 5 nvidia_drm nvidia 56934400 360 nvidia_uvm,nvidia_modeset # UVM creates device file for user-space interaction $ ls -l /dev/nvidia-uvm* crw-rw-rw- 1 root root 243, 0 Nov 2 10:00 /dev/nvidia-uvm crw-rw-rw- 1 root root 243, 1 Nov 2 10:00 /dev/nvidia-uvm-tools # Check UVM statistics (requires root) $ sudo cat /proc/driver/nvidia/uvm/stats Fault Stats: Non-replayable Faults: 245632 Replayable Faults: 1829456 Thrashing: 23 Migration Stats: CPU to GPU: 156.2 GB GPU to CPU: 89.4 GB GPU to GPU: 12.1 GB
Page Migration Performance
The cost of page faults depends on page size, bandwidth, and whether pages can be migrated in bulk:
| Page Size | Transfer Time (PCIe Gen4) | Fault Overhead | Total Latency |
|---|---|---|---|
| 4 KB (standard) | ~0.25 μs | ~10-20 μs | ~10-20 μs |
| 64 KB | ~4 μs | ~10-20 μs | ~14-24 μs |
| 2 MB (huge page) | ~125 μs | ~10-20 μs | ~135-145 μs |
| 1 GB (bulk) | ~62 ms | ~10-20 μs | ~62 ms |
Why Page Faults Aren't Always Bad: While page faults add latency, UVM is smart about batching migrations. If a kernel accesses many pages in sequence, UVM can migrate them in bulk rather than one-by-one. Modern GPUs also support replayable faults, where the GPU continues executing other warps while waiting for page migration, hiding much of the latency.
Memory Oversubscription
One of UVM's most powerful features is memory oversubscription—the ability to allocate more managed memory than physical GPU VRAM. When GPU memory fills up, UVM evicts less-recently-used pages to system RAM, similar to OS page swapping.
Eviction Policies
UVM uses a least-recently-used (LRU) eviction policy. When GPU VRAM is full and a new page must be migrated in, UVM selects the coldest page (least recently accessed) to evict to system RAM. The evicted page remains accessible—accessing it later triggers another page fault to bring it back.
Eviction Algorithm
- GPU memory fills up: All VRAM occupied by managed pages
- New page fault occurs: Kernel accesses page not in VRAM
- UVM selects victim page: Choose least-recently-used page in VRAM
- Evict victim to system RAM: DMA transfer VRAM → RAM (if page was modified)
- Fetch faulting page: DMA transfer RAM → VRAM
- Update page tables: Mark victim invalid in VRAM, faulting page valid in VRAM
Thrashing: When Oversubscription Goes Wrong
If a workload repeatedly accesses more data than fits in GPU memory, UVM enters a thrashing state where it constantly evicts and fetches pages. This destroys performance.
Detecting Thrashing
# Monitor UVM statistics while kernel runs $ watch -n 1 sudo cat /proc/driver/nvidia/uvm/stats Fault Stats: Replayable Faults: 18294560 ← Rapidly increasing = thrashing! Migration Stats: CPU to GPU: 256.8 GB ← Much larger than VRAM = excessive migration GPU to CPU: 245.2 GB # If you see faults/migrations growing rapidly → workload is thrashing # Solution: Reduce working set size or use explicit memory management
Prefetching and Hints
To avoid page fault overhead, CUDA provides APIs to prefetch data and give UVM hints about memory access patterns.
cudaMemPrefetchAsync - Explicit Prefetching
// Allocate managed memory float *data; cudaMallocManaged(&data, N * sizeof(float)); // Initialize on CPU for(int i=0; i<N; i++) data[i] = i; // Prefetch to GPU before kernel launch // This migrates data proactively, avoiding page faults in kernel int device = 0; cudaMemPrefetchAsync(data, N * sizeof(float), device); // Launch kernel - no page faults! kernel<<<blocks, threads>>>(data); cudaDeviceSynchronize(); // Prefetch back to CPU for result processing cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId); // Access on CPU - no page faults! printf("Result: %f\n", data[0]);
Prefetching eliminates page fault overhead by migrating data before it's accessed. This is especially beneficial for:
- Large bulk transfers where fault overhead dominates
- Predictable access patterns (e.g., sequential processing)
- Performance-critical kernels where page fault latency is unacceptable
cudaMemAdvise - Memory Access Hints
The cudaMemAdvise API allows providing hints to UVM about how memory will be accessed, enabling optimizations:
// Advise that data will be read-mostly from GPU cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, device); // Advise preferred location (reduce migrations) cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, device); // Advise that data will be accessed by specific device cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device);
| Advice Flag | Effect | Use Case |
|---|---|---|
cudaMemAdviseSetReadMostly | Creates read-only replicas on multiple processors | Read-only data accessed by CPU and GPU simultaneously |
cudaMemAdviseSetPreferredLocation | Sets preferred memory location, minimizing migrations | Data primarily accessed by one processor |
cudaMemAdviseSetAccessedBy | Establishes direct mapping, avoiding migrations | Fine-grained CPU-GPU sharing over fast interconnects |
cudaMemAdviseUnsetReadMostly | Removes read-only replica status | Data becomes writable again |
Performance Characteristics
When Unified Memory Performs Well
Ideal Use Cases for UVM:
- Development and prototyping: Simplifies code, allows rapid iteration without manual memory management complexity
- Memory-bound workloads: If computation time >> memory transfer time, page fault overhead is negligible
- Working sets fit in GPU memory: After initial page faults (warmup), subsequent kernel launches have no overhead
- Sparse memory access: Only accessed pages are migrated, saving bandwidth for sparse datasets
- Memory oversubscription: Enable workloads larger than GPU memory (with performance trade-off)
When Unified Memory Underperforms
Situations Where UVM Adds Overhead:
- Memory-intensive kernels: Kernels with high memory bandwidth demands suffer from page fault latency
- Random access patterns: Unpredictable access prevents effective prefetching, causing many page faults
- Excessive oversubscription: Thrashing degrades performance catastrophically
- Latency-sensitive workloads: Page fault latency (10-50 μs) unacceptable for real-time applications
- Frequent CPU-GPU ping-pong: Repeatedly migrating data back and forth wastes bandwidth
Performance Comparison: Traditional vs. UVM
Traditional cudaMemcpy (explicit):
- Bandwidth: ~28 GB/s (PCIe Gen4 practical limit)
- Latency: ~35 ms for 1 GB
- Overhead: Launch overhead + synchronization
UVM (demand paging, cold start):
- First access: ~40 ms for 1 GB (includes page faults)
- Subsequent access: ~0 μs (data already in VRAM)
UVM (prefetched):
- Bandwidth: ~28 GB/s (same as explicit copy)
- Latency: ~35 ms for 1 GB
- Overhead: Minimal (async prefetch)
Takeaway: With proper prefetching, UVM performance matches explicit memory management. Without prefetching, first access pays page fault cost, but subsequent accesses are free. For iterative workloads (training loops), first-iteration overhead is amortized.
Using Unified Memory in Practice
Basic Usage Pattern
// 1. Allocate managed memory float *data; size_t size = N * sizeof(float); cudaMallocManaged(&data, size); // 2. Initialize (CPU or GPU) for(int i=0; i<N; i++) data[i] = init_value(i); // 3. (Optional) Prefetch to GPU cudaMemPrefetchAsync(data, size, 0); // 4. Launch kernel kernel<<<blocks, threads>>>(data); cudaDeviceSynchronize(); // 5. (Optional) Prefetch back to CPU cudaMemPrefetchAsync(data, size, cudaCpuDeviceId); // 6. Use results on CPU printf("Result: %f\n", data[0]); // 7. Cleanup cudaFree(data);
Multi-GPU with Unified Memory
// Allocate managed memory float *data; cudaMallocManaged(&data, size); // Set preferred location and access patterns int device0 = 0, device1 = 1; // Split data: first half prefers GPU 0, second half prefers GPU 1 size_t half = size / 2; cudaMemAdvise(data, half, cudaMemAdviseSetPreferredLocation, device0); cudaMemAdvise(data + half, half, cudaMemAdviseSetPreferredLocation, device1); // Allow both GPUs to access all data cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device0); cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device1); // Launch kernels on both GPUs cudaSetDevice(0); kernel<<<blocks, threads>>>(data, 0, half); cudaSetDevice(1); kernel<<<blocks, threads>>>(data + half, half, size); // Synchronize both cudaSetDevice(0); cudaDeviceSynchronize(); cudaSetDevice(1); cudaDeviceSynchronize();
Deep Learning Framework Integration
PyTorch and TensorFlow don't use Unified Memory by default—they use explicit memory management for predictable performance. However, you can allocate managed tensors manually:
# PyTorch: Create tensor from managed memory pointer (requires C++ extension) # Not directly supported - PyTorch uses explicit cudaMalloc # However, for custom CUDA operations: import torch from torch.utils.cpp_extension import load # Compile custom C++/CUDA extension that uses cudaMallocManaged uvm_ops = load( name="uvm_ops", sources=["uvm_ops.cu"], extra_cuda_cflags=["-O3"] ) # In uvm_ops.cu: // torch::Tensor create_managed_tensor(int64_t size) { // float *data; // cudaMallocManaged(&data, size * sizeof(float)); // return torch::from_blob(data, {size}); // }
Debugging and Troubleshooting
Checking UVM Support
// Check if GPU supports managed memory int device = 0; cudaDeviceProp prop; cudaGetDeviceProperties(&prop, device); if (prop.managedMemory) { printf("GPU supports Unified Memory\n"); printf("Concurrent managed access: %d\n", prop.concurrentManagedAccess); printf("Page migration: %d\n", prop.pageableMemoryAccess); } else { printf("GPU does NOT support Unified Memory\n"); }
GPU Requirements for UVM:
- Compute Capability 3.0+: Basic managed memory support (Pascal and newer recommended)
- Compute Capability 6.0+ (Pascal): Page faulting and migration
- Compute Capability 7.0+ (Volta): Concurrent CPU/GPU access
Common Issues
Error: "cudaErrorIllegalAddress" with managed memory
Cause: Accessing managed memory from CPU while GPU kernel is running (pre-Pascal GPUs) or accessing out-of-bounds.
Solution:
// Always synchronize before CPU access on older GPUs kernel<<<...>>>(data); cudaDeviceSynchronize(); // ← REQUIRED before CPU access printf("%f\n", data[0]); // On Volta+ (CC 7.0+), concurrent access works: kernel<<<...>>>(data); printf("%f\n", data[0]); // ← OK without sync
Performance Issue: Kernel running slower with UVM
Cause: Page faults during kernel execution.
Solution:
// Profile with nvprof/nsight to see page faults $ nvprof --print-gpu-trace ./app // If you see many page faults: // 1. Add prefetching cudaMemPrefetchAsync(data, size, device); // 2. Or use cudaMemAdvise cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, device); // 3. For best performance: switch to explicit management cudaMalloc(&d_data, size); cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);
Conclusion
NVIDIA Unified Virtual Memory represents a fundamental shift in GPU programming—from manual, explicit memory management to automatic, demand-driven migration. By implementing a virtual memory system spanning CPU and GPU address spaces, UVM dramatically simplifies CUDA code while enabling memory oversubscription for workloads exceeding GPU capacity.
The key insight is page faulting with automatic migration. When GPU accesses unmapped memory, the nvidia-uvm kernel module intercepts the fault, identifies the required page's location, and initiates DMA transfer. This happens transparently—programmers write simple code accessing a unified pointer, and UVM handles the complexity behind the scenes.
However, UVM is not a silver bullet. Page faults add latency (~10-50 μs per fault), which can accumulate for memory-intensive kernels. Thrashing from excessive oversubscription can destroy performance. For production workloads demanding maximum throughput, explicit memory management with cudaMemcpy often remains the best choice.
The sweet spot for UVM is development, prototyping, and workloads with predictable access patterns. Combined with prefetching (cudaMemPrefetchAsync) and hints (cudaMemAdvise), UVM can match explicit management performance while maintaining code simplicity. For researchers iterating on algorithms or engineers building complex applications, UVM eliminates an entire class of memory management bugs, allowing focus on the actual computation rather than data movement logistics.
As GPU memory capacities grow and interconnect speeds increase (PCIe Gen5, CXL), the performance gap between UVM and explicit management narrows. Future GPUs will likely make Unified Memory the default, with explicit management reserved for specialized optimization. Understanding UVM today prepares you for this future while making current CUDA development substantially simpler.
