Understanding NVIDIA Unified Virtual Memory

NVIDIA Unified Virtual Memory (UVM) is a memory management system that provides a single unified address space accessible from both CPU and GPU. Instead of manually copying data between host and device memory, UVM automatically migrates pages on-demand via page faulting. The system transparently handles memory oversubscription, allowing GPU memory to exceed physical VRAM by evicting pages to system RAM. This dramatically simplifies CUDA programming while enabling workloads larger than GPU memory.

The Manual Memory Management Problem

Traditional CUDA programming requires explicit memory management. Programmers must manually allocate separate memory spaces on CPU and GPU, then explicitly copy data between them. This creates substantial complexity and opportunities for error.

Code Comparison

Traditional CUDA requires verbose, error-prone code with explicit memory copies:

// Traditional CUDA: Manual Memory Management
// Allocate on host
float *h_data;
h_data = (float*)malloc(N * sizeof(float));

// Allocate on device
float *d_data;
cudaMalloc(&d_data, N * sizeof(float));

// Initialize on host
for(int i=0; i<N; i++)
    h_data[i] = i;

// Copy to device
cudaMemcpy(d_data, h_data,
    N * sizeof(float),
    cudaMemcpyHostToDevice);

// Launch kernel
kernel<<<blocks, threads>>>(d_data);
cudaDeviceSynchronize();

// Copy back to host
cudaMemcpy(h_data, d_data,
    N * sizeof(float),
    cudaMemcpyDeviceToHost);

// Cleanup
free(h_data);
cudaFree(d_data);

15+ lines of boilerplate, with manual tracking of two separate pointers and explicit synchronization.

Unified Memory simplifies this dramatically:

// Unified Memory: Automatic Management
// Single allocation
float *data;
cudaMallocManaged(&data,
    N * sizeof(float));

// Initialize (on CPU or GPU)
for(int i=0; i<N; i++)
    data[i] = i;

// Launch kernel
// UVM migrates automatically!
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

// Access results (on CPU or GPU)
printf("Result: %f\n", data[0]);

// Cleanup
cudaFree(data);

7 lines, no explicit copies! UVM handles migration transparently based on access patterns.

How Unified Virtual Memory Works

Page Faulting and On-Demand Migration

UVM operates on the principle of demand paging, similar to OS virtual memory. When CPU or GPU accesses memory not currently present in its address space, a page fault occurs. The UVM driver intercepts this fault and migrates the required page.

The nvidia-uvm Kernel Module

Unified Virtual Memory is implemented by the nvidia-uvm.ko kernel module, which works in conjunction with the main NVIDIA driver (nvidia.ko). The UVM module maintains page tables that track where each page currently resides—CPU memory, GPU memory, or both.

# Check if nvidia-uvm module is loaded
$ lsmod | grep nvidia
nvidia_uvm           1327104  0
nvidia_drm             69632  4
nvidia_modeset       1302528  5 nvidia_drm
nvidia              56934400  360 nvidia_uvm,nvidia_modeset

# UVM creates device file for user-space interaction
$ ls -l /dev/nvidia-uvm*
crw-rw-rw- 1 root root 243, 0 Nov  2 10:00 /dev/nvidia-uvm
crw-rw-rw- 1 root root 243, 1 Nov  2 10:00 /dev/nvidia-uvm-tools

# Check UVM statistics (requires root)
$ sudo cat /proc/driver/nvidia/uvm/stats
Fault Stats:
  Non-replayable Faults: 245632
  Replayable Faults: 1829456
  Thrashing: 23

Migration Stats:
  CPU to GPU: 156.2 GB
  GPU to CPU: 89.4 GB
  GPU to GPU: 12.1 GB

Page Migration Performance

The cost of page faults depends on page size, bandwidth, and whether pages can be migrated in bulk:

Page Size	Transfer Time (PCIe Gen4)	Fault Overhead	Total Latency
4 KB (standard)	~0.25 μs	~10-20 μs	~10-20 μs
64 KB	~4 μs	~10-20 μs	~14-24 μs
2 MB (huge page)	~125 μs	~10-20 μs	~135-145 μs
1 GB (bulk)	~62 ms	~10-20 μs	~62 ms

Why Page Faults Aren't Always Bad: While page faults add latency, UVM is smart about batching migrations. If a kernel accesses many pages in sequence, UVM can migrate them in bulk rather than one-by-one. Modern GPUs also support replayable faults, where the GPU continues executing other warps while waiting for page migration, hiding much of the latency.

Memory Oversubscription

One of UVM's most powerful features is memory oversubscription—the ability to allocate more managed memory than physical GPU VRAM. When GPU memory fills up, UVM evicts less-recently-used pages to system RAM, similar to OS page swapping.

Eviction Policies

UVM uses a least-recently-used (LRU) eviction policy. When GPU VRAM is full and a new page must be migrated in, UVM selects the coldest page (least recently accessed) to evict to system RAM. The evicted page remains accessible—accessing it later triggers another page fault to bring it back.

Eviction Algorithm

GPU memory fills up: All VRAM occupied by managed pages
New page fault occurs: Kernel accesses page not in VRAM
UVM selects victim page: Choose least-recently-used page in VRAM
Evict victim to system RAM: DMA transfer VRAM → RAM (if page was modified)
Fetch faulting page: DMA transfer RAM → VRAM
Update page tables: Mark victim invalid in VRAM, faulting page valid in VRAM

Thrashing: When Oversubscription Goes Wrong

If a workload repeatedly accesses more data than fits in GPU memory, UVM enters a thrashing state where it constantly evicts and fetches pages. This destroys performance.

Detecting Thrashing

# Monitor UVM statistics while kernel runs
$ watch -n 1 sudo cat /proc/driver/nvidia/uvm/stats

Fault Stats:
  Replayable Faults: 18294560  ← Rapidly increasing = thrashing!

Migration Stats:
  CPU to GPU: 256.8 GB  ← Much larger than VRAM = excessive migration
  GPU to CPU: 245.2 GB

# If you see faults/migrations growing rapidly → workload is thrashing
# Solution: Reduce working set size or use explicit memory management

Prefetching and Hints

To avoid page fault overhead, CUDA provides APIs to prefetch data and give UVM hints about memory access patterns.

cudaMemPrefetchAsync - Explicit Prefetching

// Allocate managed memory
float *data;
cudaMallocManaged(&data, N * sizeof(float));

// Initialize on CPU
for(int i=0; i<N; i++)
    data[i] = i;

// Prefetch to GPU before kernel launch
// This migrates data proactively, avoiding page faults in kernel
int device = 0;
cudaMemPrefetchAsync(data, N * sizeof(float), device);

// Launch kernel - no page faults!
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

// Prefetch back to CPU for result processing
cudaMemPrefetchAsync(data, N * sizeof(float), cudaCpuDeviceId);

// Access on CPU - no page faults!
printf("Result: %f\n", data[0]);

Prefetching eliminates page fault overhead by migrating data before it's accessed. This is especially beneficial for:

Large bulk transfers where fault overhead dominates
Predictable access patterns (e.g., sequential processing)
Performance-critical kernels where page fault latency is unacceptable

cudaMemAdvise - Memory Access Hints

The cudaMemAdvise API allows providing hints to UVM about how memory will be accessed, enabling optimizations:

// Advise that data will be read-mostly from GPU
cudaMemAdvise(data, size, cudaMemAdviseSetReadMostly, device);

// Advise preferred location (reduce migrations)
cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, device);

// Advise that data will be accessed by specific device
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device);

Advice Flag	Effect	Use Case
`cudaMemAdviseSetReadMostly`	Creates read-only replicas on multiple processors	Read-only data accessed by CPU and GPU simultaneously
`cudaMemAdviseSetPreferredLocation`	Sets preferred memory location, minimizing migrations	Data primarily accessed by one processor
`cudaMemAdviseSetAccessedBy`	Establishes direct mapping, avoiding migrations	Fine-grained CPU-GPU sharing over fast interconnects
`cudaMemAdviseUnsetReadMostly`	Removes read-only replica status	Data becomes writable again

Performance Characteristics

When Unified Memory Performs Well

Ideal Use Cases for UVM:

Development and prototyping: Simplifies code, allows rapid iteration without manual memory management complexity
Memory-bound workloads: If computation time >> memory transfer time, page fault overhead is negligible
Working sets fit in GPU memory: After initial page faults (warmup), subsequent kernel launches have no overhead
Sparse memory access: Only accessed pages are migrated, saving bandwidth for sparse datasets
Memory oversubscription: Enable workloads larger than GPU memory (with performance trade-off)

When Unified Memory Underperforms

Situations Where UVM Adds Overhead:

Memory-intensive kernels: Kernels with high memory bandwidth demands suffer from page fault latency
Random access patterns: Unpredictable access prevents effective prefetching, causing many page faults
Excessive oversubscription: Thrashing degrades performance catastrophically
Latency-sensitive workloads: Page fault latency (10-50 μs) unacceptable for real-time applications
Frequent CPU-GPU ping-pong: Repeatedly migrating data back and forth wastes bandwidth

Performance Comparison: Traditional vs. UVM

Traditional cudaMemcpy (explicit):

Bandwidth: ~28 GB/s (PCIe Gen4 practical limit)
Latency: ~35 ms for 1 GB
Overhead: Launch overhead + synchronization

UVM (demand paging, cold start):

First access: ~40 ms for 1 GB (includes page faults)
Subsequent access: ~0 μs (data already in VRAM)

UVM (prefetched):

Bandwidth: ~28 GB/s (same as explicit copy)
Latency: ~35 ms for 1 GB
Overhead: Minimal (async prefetch)

Takeaway: With proper prefetching, UVM performance matches explicit memory management. Without prefetching, first access pays page fault cost, but subsequent accesses are free. For iterative workloads (training loops), first-iteration overhead is amortized.

Using Unified Memory in Practice

Basic Usage Pattern

// 1. Allocate managed memory
float *data;
size_t size = N * sizeof(float);
cudaMallocManaged(&data, size);

// 2. Initialize (CPU or GPU)
for(int i=0; i<N; i++)
    data[i] = init_value(i);

// 3. (Optional) Prefetch to GPU
cudaMemPrefetchAsync(data, size, 0);

// 4. Launch kernel
kernel<<<blocks, threads>>>(data);
cudaDeviceSynchronize();

// 5. (Optional) Prefetch back to CPU
cudaMemPrefetchAsync(data, size, cudaCpuDeviceId);

// 6. Use results on CPU
printf("Result: %f\n", data[0]);

// 7. Cleanup
cudaFree(data);

Multi-GPU with Unified Memory

// Allocate managed memory
float *data;
cudaMallocManaged(&data, size);

// Set preferred location and access patterns
int device0 = 0, device1 = 1;

// Split data: first half prefers GPU 0, second half prefers GPU 1
size_t half = size / 2;
cudaMemAdvise(data, half, cudaMemAdviseSetPreferredLocation, device0);
cudaMemAdvise(data + half, half, cudaMemAdviseSetPreferredLocation, device1);

// Allow both GPUs to access all data
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device0);
cudaMemAdvise(data, size, cudaMemAdviseSetAccessedBy, device1);

// Launch kernels on both GPUs
cudaSetDevice(0);
kernel<<<blocks, threads>>>(data, 0, half);

cudaSetDevice(1);
kernel<<<blocks, threads>>>(data + half, half, size);

// Synchronize both
cudaSetDevice(0); cudaDeviceSynchronize();
cudaSetDevice(1); cudaDeviceSynchronize();

Deep Learning Framework Integration

PyTorch and TensorFlow don't use Unified Memory by default—they use explicit memory management for predictable performance. However, you can allocate managed tensors manually:

# PyTorch: Create tensor from managed memory pointer (requires C++ extension)
# Not directly supported - PyTorch uses explicit cudaMalloc

# However, for custom CUDA operations:
import torch
from torch.utils.cpp_extension import load

# Compile custom C++/CUDA extension that uses cudaMallocManaged
uvm_ops = load(
    name="uvm_ops",
    sources=["uvm_ops.cu"],
    extra_cuda_cflags=["-O3"]
)

# In uvm_ops.cu:
// torch::Tensor create_managed_tensor(int64_t size) {
//     float *data;
//     cudaMallocManaged(&data, size * sizeof(float));
//     return torch::from_blob(data, {size});
// }

Debugging and Troubleshooting

Checking UVM Support

// Check if GPU supports managed memory
int device = 0;
cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, device);

if (prop.managedMemory) {
    printf("GPU supports Unified Memory\n");
    printf("Concurrent managed access: %d\n",
           prop.concurrentManagedAccess);
    printf("Page migration: %d\n",
           prop.pageableMemoryAccess);
} else {
    printf("GPU does NOT support Unified Memory\n");
}

GPU Requirements for UVM:

Compute Capability 3.0+: Basic managed memory support (Pascal and newer recommended)
Compute Capability 6.0+ (Pascal): Page faulting and migration
Compute Capability 7.0+ (Volta): Concurrent CPU/GPU access

Common Issues

Error: "cudaErrorIllegalAddress" with managed memory

Cause: Accessing managed memory from CPU while GPU kernel is running (pre-Pascal GPUs) or accessing out-of-bounds.

Solution:

// Always synchronize before CPU access on older GPUs
kernel<<<...>>>(data);
cudaDeviceSynchronize();  // ← REQUIRED before CPU access
printf("%f\n", data[0]);

// On Volta+ (CC 7.0+), concurrent access works:
kernel<<<...>>>(data);
printf("%f\n", data[0]);  // ← OK without sync

Performance Issue: Kernel running slower with UVM

Cause: Page faults during kernel execution.

Solution:

// Profile with nvprof/nsight to see page faults
$ nvprof --print-gpu-trace ./app

// If you see many page faults:
// 1. Add prefetching
cudaMemPrefetchAsync(data, size, device);

// 2. Or use cudaMemAdvise
cudaMemAdvise(data, size, cudaMemAdviseSetPreferredLocation, device);

// 3. For best performance: switch to explicit management
cudaMalloc(&d_data, size);
cudaMemcpy(d_data, h_data, size, cudaMemcpyHostToDevice);

Conclusion

NVIDIA Unified Virtual Memory represents a fundamental shift in GPU programming—from manual, explicit memory management to automatic, demand-driven migration. By implementing a virtual memory system spanning CPU and GPU address spaces, UVM dramatically simplifies CUDA code while enabling memory oversubscription for workloads exceeding GPU capacity.

The key insight is page faulting with automatic migration. When GPU accesses unmapped memory, the nvidia-uvm kernel module intercepts the fault, identifies the required page's location, and initiates DMA transfer. This happens transparently—programmers write simple code accessing a unified pointer, and UVM handles the complexity behind the scenes.

However, UVM is not a silver bullet. Page faults add latency (~10-50 μs per fault), which can accumulate for memory-intensive kernels. Thrashing from excessive oversubscription can destroy performance. For production workloads demanding maximum throughput, explicit memory management with cudaMemcpy often remains the best choice.

The sweet spot for UVM is development, prototyping, and workloads with predictable access patterns. Combined with prefetching (cudaMemPrefetchAsync) and hints (cudaMemAdvise), UVM can match explicit management performance while maintaining code simplicity. For researchers iterating on algorithms or engineers building complex applications, UVM eliminates an entire class of memory management bugs, allowing focus on the actual computation rather than data movement logistics.

As GPU memory capacities grow and interconnect speeds increase (PCIe Gen5, CXL), the performance gap between UVM and explicit management narrows. Future GPUs will likely make Unified Memory the default, with explicit management reserved for specialized optimization. Understanding UVM today prepares you for this future while making current CUDA development substantially simpler.

Table of Contents