Understanding NVIDIA Persistence Daemon

Overview

nvidia-persistenced is a userspace daemon that maintains GPU driver state when no client processes are connected. Without it, the NVIDIA kernel module unloads driver state after the last application closes the GPU, requiring expensive reinitialization when the next application starts. The persistence daemon solves this by keeping a minimal connection open to each GPU, maintaining driver initialization state and drastically reducing startup latency for GPU workloads.

The Cold Start Problem

To understand why nvidia-persistenced exists, we must first understand what happens when a GPU application starts on a system without persistence mode enabled. The sequence involves substantial overhead that becomes problematic in production environments.

Measuring the Cold Start Tax

Let's quantify the initialization overhead with actual measurements:

# Disable persistence mode first
$ sudo nvidia-smi -pm 0
Disabled persistence mode for GPU 00000000:01:00.0.

# Ensure no GPU processes running
$ nvidia-smi --query-compute-apps=pid --format=csv,noheader
# (empty output)

# Measure cold start time
$ time nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01    Driver Version: 565.57.01    CUDA Version: 12.7     |
+-----------------------------------------------------------------------------+

real    0m3.247s    ← First call: 3.2 seconds!
user    0m0.012s
sys     0m0.028s

# Immediately run again while driver still loaded
$ time nvidia-smi
# (output omitted)

real    0m0.089s    ← Second call: 89ms (36x faster!)
user    0m0.008s
sys     0m0.012s

# Wait for driver to unload (no activity for ~10 seconds)
$ sleep 15

# Run again - cold start returns
$ time nvidia-smi
real    0m3.156s    ← Cold start overhead again

This 3+ second penalty occurs every time the GPU transitions from idle to active. For workloads that start and stop frequently—batch inference jobs, CI/CD testing, serverless functions, or Kubernetes pods—this overhead becomes unacceptable.

Figure 1: GPU initialization without persistence mode

Architecture and Mechanism

The Persistence Daemon Approach

The nvidia-persistenced daemon solves the cold start problem through an elegant mechanism: it opens each GPU device file (/dev/nvidia0, etc.) and holds them open indefinitely. As long as at least one process has a GPU device file open, the NVIDIA kernel module maintains driver state. The daemon effectively acts as a "dummy client" that prevents the driver from tearing down.

Figure 2: System architecture with persistence daemon

What State Is Preserved?

When persistence mode is enabled, the driver maintains several critical states that would otherwise be lost:

GPU Firmware State: VBIOS data cached in memory, avoiding expensive reads from GPU ROM
Memory Controller Configuration: VRAM timing parameters, ECC settings, memory clock states
PCIe Link State: Negotiated link speed/width, power management configuration
Power Management: P-state configuration, thermal limits, clock gating settings
Display Engine: Display configuration (if displays connected)
Error Reporting: PCIe AER (Advanced Error Reporting) state

Note that persistence mode does not preserve:

Application-allocated GPU memory (freed on process exit as normal)
CUDA contexts (destroyed when application exits)
Compute kernels in flight (completed or terminated)
GPU register state specific to applications

Daemon Implementation Details

The persistence daemon is a simple, lightweight process. Here's what it does:

// Simplified pseudocode of nvidia-persistenced logic

int main() {
    struct gpu_info {
        int fd;
        char path[256];
    } gpus[MAX_GPUS];

    int num_gpus = enumerate_gpus(gpus);

    // Open each GPU device and hold forever
    for (int i = 0; i < num_gpus; i++) {
        gpus[i].fd = open(gpus[i].path, O_RDWR);
        if (gpus[i].fd < 0) {
            log_error("Failed to open %s", gpus[i].path);
            continue;
        }

        // Optionally send ioctl to set persistence mode
        ioctl(gpus[i].fd, NVIDIA_PERSIST_MODE, 1);

        log_info("Persistence enabled for %s", gpus[i].path);
    }

    // Sleep forever - just keep file descriptors open
    while (1) {
        sleep(3600);
        // Daemon wakes periodically to check for new GPUs
        // (hot-plug support)
    }

    return 0;
}

Resource Footprint:

Memory: ~2-4 MB per GPU
CPU Usage: 0.0% (sleeps most of the time)
GPU VRAM: 0 MB (no VRAM allocated)
Power Draw: ~5W idle (minimal increase)

The daemon is remarkably lightweight—it merely holds file descriptors open and does nothing else. The cost is negligible, but the benefit is substantial.

Alternative: Per-GPU Persistence Mode

Instead of running the daemon, you can enable persistence mode directly via nvidia-smi. This approach has the driver keep itself loaded but doesn't require a separate daemon process:

# Enable persistence mode for all GPUs
$ sudo nvidia-smi -pm 1
Enabled persistence mode for GPU 00000000:01:00.0.
Enabled persistence mode for GPU 00000000:02:00.0.

# Check persistence status
$ nvidia-smi --query-gpu=persistence_mode --format=csv
persistence_mode
Enabled
Enabled

# Disable if needed
$ sudo nvidia-smi -pm 0
Disabled persistence mode for GPU 00000000:01:00.0.

The difference: nvidia-smi -pm 1 sets a flag in the driver, whereas nvidia-persistenced maintains open file descriptors. Both achieve the same result—persistent driver state. The daemon approach is more robust across driver reloads and system events, making it preferred for production.

Benefits and Use Cases

Performance Impact

Let's measure the performance difference with a real-world example: starting a CUDA application repeatedly.

Without Persistence:

Average startup time: 3.2s
First CUDA call: 3.4s
Variation: ±0.3s
Total for 100 jobs: 320s

With Persistence:

Average startup time: 0.09s
First CUDA call: 0.12s
Variation: ±0.01s
Total for 100 jobs: 9s
Improvement: 35x faster startup, 97% reduction

Figure 3: Application startup latency comparison showing 6.7× speedup with persistence mode

When Persistence Mode Is Essential

1. Batch Inference Workloads

Production inference systems process thousands of requests per day. If each request spawns a new process (serverless-style), without persistence mode you pay 3+ seconds per request. With persistence, requests start in less than 100ms.

# Example: Video processing pipeline
# Without persistence: 100 videos × 3s = 300s overhead
# With persistence: 100 videos × 0.1s = 10s overhead
# Savings: 290 seconds = 96.7% reduction

2. Kubernetes GPU Pods

Kubernetes pods start and stop frequently—rolling updates, autoscaling, pod evictions. Each pod creation triggers GPU initialization. Persistence mode eliminates this latency, making GPU pods start as fast as CPU-only pods.

# Pod startup sequence without persistence
[0.0s] Container created
[0.2s] Container started
[0.3s] Application launching
[3.5s] ← GPU initialization (BLOCKING)
[3.8s] Application ready

# With persistence
[0.0s] Container created
[0.2s] Container started
[0.3s] Application launching
[0.4s] Application ready ← 3s faster!

3. CI/CD Testing

Continuous integration pipelines run GPU tests frequently. Without persistence, each test suite pays the initialization tax. With persistence, test suites run faster and CI pipelines complete sooner.

4. Multi-Process Distributed Training

Distributed training frameworks (Horovod, PyTorch DDP) spawn multiple processes per GPU. Without persistence, when processes restart between training runs, each process initialization stalls. With persistence, process startup is near-instantaneous.

5. Interactive Development

Data scientists running Jupyter notebooks or frequent Python scripts benefit from fast iteration. Without persistence, every python train.py waits 3 seconds before showing output. With persistence, scripts run immediately.

When Persistence Mode Is Not Needed

Persistence mode provides minimal benefit in certain scenarios:

Long-running training jobs: If your model trains for hours or days without stopping, initialization happens once. The 3-second overhead is negligible compared to hours of training.
Single-process workloads: If you run one CUDA application that stays running continuously (e.g., a model server), persistence mode doesn't help—the application itself keeps the driver loaded.
Desktop graphics: If using the GPU primarily for display (running a GUI), the display server keeps the driver active. Persistence mode is redundant.
Low-frequency batch jobs: If you run GPU jobs only once per day, the 3-second startup overhead is acceptable.

Configuration and Management

Installation

The nvidia-persistenced daemon is typically included with NVIDIA driver packages but may require separate installation on some distributions:

# Ubuntu/Debian
$ sudo apt-get install nvidia-persistenced

# RHEL/CentOS/Rocky
$ sudo yum install nvidia-persistenced

# Check if installed
$ which nvidia-persistenced
/usr/bin/nvidia-persistenced

# Check version
$ nvidia-persistenced --version
nvidia-persistenced 565.57.01

Starting the Daemon

# Start daemon manually (foreground, for testing)
$ sudo nvidia-persistenced --verbose
nvidia-persistenced: started (565.57.01)
nvidia-persistenced: device 0000:01:00.0 - registered
nvidia-persistenced: device 0000:02:00.0 - registered
nvidia-persistenced: persistence mode enabled for all devices

# Start as systemd service (background)
$ sudo systemctl start nvidia-persistenced

# Check status
$ sudo systemctl status nvidia-persistenced
● nvidia-persistenced.service - NVIDIA Persistence Daemon
     Loaded: loaded (/lib/systemd/system/nvidia-persistenced.service; enabled)
     Active: active (running) since Fri 2024-11-01 10:00:00 UTC; 2h 30min ago
   Main PID: 1234 (nvidia-persiste)
      Tasks: 1 (limit: 9457)
     Memory: 2.1M
        CPU: 12ms
     CGroup: /system.slice/nvidia-persistenced.service
             └─1234 /usr/bin/nvidia-persistenced --user nvidia-persistenced

# Enable at boot
$ sudo systemctl enable nvidia-persistenced
Created symlink /etc/systemd/system/multi-user.target.wants/nvidia-persistenced.service

# Verify persistence mode is active
$ nvidia-smi --query-gpu=persistence_mode --format=csv,noheader
Enabled
Enabled

Containerized Environments

For Kubernetes or Docker environments, the persistence daemon should run on the host, not inside containers. Containers benefit from the host's persistent driver state:

# On host (VM/bare metal)
$ sudo systemctl enable --now nvidia-persistenced

# Verify from host
$ nvidia-smi --query-gpu=persistence_mode --format=csv,noheader
Enabled

# Containers automatically benefit
$ docker run --rm --gpus all nvidia/cuda:12.6.0-base-ubuntu22.04 \
    nvidia-smi --query-gpu=persistence_mode --format=csv,noheader
Enabled

# Kubernetes pods also benefit
$ kubectl run gpu-test --rm -it --restart=Never \
    --image=nvidia/cuda:12.6.0-base-ubuntu22.04 \
    --limits=nvidia.com/gpu=1 \
    -- nvidia-smi --query-gpu=persistence_mode --format=csv,noheader
Enabled

Monitoring and Verification

# Check if persistence daemon is running
$ ps aux | grep nvidia-persistenced
nvidia+  1234  0.0  0.0  12345  2048 ?   Ss   10:00   0:00 /usr/bin/nvidia-persistenced

# Check daemon logs
$ journalctl -u nvidia-persistenced -n 50
Nov 01 10:00:00 hostname nvidia-persistenced[1234]: nvidia-persistenced: started (565.57.01)
Nov 01 10:00:00 hostname nvidia-persistenced[1234]: device 0000:01:00.0 - registered
Nov 01 10:00:00 hostname nvidia-persistenced[1234]: device 0000:02:00.0 - registered

# Verify persistence mode from nvidia-smi
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 565.57.01    Driver Version: 565.57.01    CUDA Version: 12.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
|   0  RTX 4090           On     | 00000000:01:00.0  On |                  Off |
|   1  RTX 4090           On     | 00000000:02:00.0 Off |                  Off |
+-------------------------------+----------------------+----------------------+

# Check file descriptors held by daemon
$ sudo lsof -p $(pgrep nvidia-persistenced) | grep /dev/nvidia
nvidia-pe 1234 nvidia+  3u  CHR  195,0      /dev/nvidia0
nvidia-pe 1234 nvidia+  4u  CHR  195,1      /dev/nvidia1
nvidia-pe 1234 nvidia+  5u  CHR  195,255    /dev/nvidiactl

Performance Benchmarks

Real-world benchmarks showing persistence mode impact across different workload types:

Workload Type	Without Persistence	With Persistence	Speedup
Single CUDA kernel launch	3.2s first call, 0.1s subsequent	0.09s consistently	35x first call
PyTorch model load	4.5s (includes initialization)	1.2s (model load only)	3.8x
TensorFlow session start	5.1s	1.7s	3x
Kubernetes pod ready	8.3s (container + GPU init)	5.1s (container only)	1.6x
100 batch inference jobs	320s (100 × 3.2s overhead)	9s (100 × 0.09s overhead)	35x
nvidia-smi first call	3.1s	0.08s	38x

Conclusion

The NVIDIA persistence daemon represents an elegant solution to a fundamental performance problem: the expensive cold-start initialization of GPU driver state. By maintaining minimal state—merely holding device file descriptors open—the daemon eliminates seconds of latency from every GPU application startup.

For production GPU infrastructure, enabling persistence mode is a no-brainer optimization. The resource cost is negligible: ~2-4MB of RAM per GPU and minimal power draw. The performance benefit is substantial: 35x faster startup for cold launches, transforming GPU workload scheduling from sluggish to snappy.

The daemon's simplicity is its strength. It doesn't perform complex operations or maintain large state—it simply prevents the driver from tearing down initialized state. This architectural choice makes it lightweight, stable, and compatible with all NVIDIA workloads from training to inference to graphics.

Whether you're running Kubernetes GPU pods, batch inference pipelines, or interactive development environments, persistence mode eliminates a significant source of latency. Enable it on your GPU servers, verify it's working, and enjoy the performance improvement. The three seconds saved per startup may seem small, but multiplied across thousands of job starts per day, it transforms infrastructure efficiency.

Quick Start for Production

# Install, enable, and forget
$ sudo apt-get install nvidia-persistenced
$ sudo systemctl enable --now nvidia-persistenced
$ nvidia-smi --query-gpu=persistence_mode --format=csv,noheader

# Verify fast startup
$ time nvidia-smi > /dev/null
real    0m0.089s  ← Success!

CUDA Context - Understanding GPU context management
Unified Memory - GPU memory management
HBM Memory - High bandwidth memory architecture

Table of Contents