CUDA Multi-Process Service (MPS)

What is CUDA Multi-Process Service (MPS)?

CUDA Multi-Process Service (MPS) is a client-server architecture that enables multiple CUDA processes to share a single GPU context, allowing them to submit work concurrently to the GPU and achieve better utilization. Without MPS, CUDA contexts from different processes are time-sliced sequentially, leading to GPU underutilization when individual processes launch small kernels.

MPS eliminates this overhead by multiplexing work from multiple clients through a single server process that manages a shared GPU context.

The Problem: GPU Underutilization

Modern NVIDIA GPUs contain thousands of CUDA cores capable of executing work from multiple kernels simultaneously. However, the default CUDA execution model creates isolation between processes by giving each its own exclusive GPU context.

When multiple processes try to use the GPU, the driver time-slices these contexts—meaning only one process can submit work at a time, and context switches incur significant overhead.

Time-Slicing Issues

Consider a scenario where you have multiple small inference services running—each launches CUDA kernels that use only 20% of the GPU's streaming multiprocessors (SMs):

Process A runs its kernel using 20% of GPU → 80% of SMs idle
Context switch overhead (~10-100 microseconds)
Process B runs its kernel using 20% of GPU → 80% of SMs idle
Context switch overhead
Process C runs → more idle time

The GPU spends most of its time either idle or switching contexts. With MPS, all three processes submit work concurrently through a shared context, and the GPU scheduler assigns them to different SMs simultaneously—achieving 60% utilization instead of 20%.

MPS Architecture

MPS operates through a client-server model with three key components:

1. MPS Control Daemon

Binary: nvidia-cuda-mps-control
Role: Management interface
Functions:
- Start/stop MPS servers
- Configure per-device settings
- Handle client connections
- Manage pipe directories

2. MPS Server

Binary: nvidia-cuda-mps-server
Role: GPU context owner
Functions:
- Create shared GPU context
- Multiplex CUDA calls from multiple clients
- Submit kernels to GPU
- Manage device memory

3. Client Library

Library: libcuda.so (MPS-aware)
Role: Transparent interception
Functions:
- Intercept CUDA API calls
- Route to MPS server via named pipes
- Handle synchronization
- Manage client state

How MPS Works

When a CUDA application runs under MPS, the execution flow changes fundamentally:

1. Application Launch

Client process starts and loads the CUDA runtime library (libcuda.so). If MPS environment variables are set, the library detects MPS mode.

2. MPS Connection

The CUDA library connects to the MPS control daemon via named pipes in /tmp/nvidia-mps/ (or path specified by CUDA_MPS_PIPE_DIRECTORY). Control daemon authenticates the client and provides a connection to the appropriate MPS server.

3. Context Initialization

Instead of creating its own GPU context, the client receives a handle to the shared context managed by the MPS server. This is transparent to the application—it still uses standard CUDA API calls.

4. Kernel Launch

When the application calls cudaLaunchKernel(), the CUDA library serializes the kernel parameters and sends them through the pipe to the MPS server. The server queues the work and submits it to the GPU using its shared context.

5. Concurrent Execution

The GPU's hardware scheduler receives kernels from multiple clients (via the single MPS server) and distributes them across available SMs. Kernels from different clients can execute simultaneously if resources permit.

6. Synchronization

When a client calls cudaDeviceSynchronize(), it waits on its own submitted work. The MPS server tracks which kernels belong to which client and signals completion appropriately.

Benefits of MPS

Reduced Overhead

Single GPU context eliminates context switch costs (50-100μs per switch)

Better GPU Utilization

Multiple small kernels can fill the GPU instead of leaving it mostly idle

Transparency

Applications require no code changes—MPS operates at the CUDA driver level

Simplified Management

Control daemon provides centralized administration

Improved Isolation (Volta+)

Hardware improvements on Volta and newer architectures provide better isolation between clients with Address Space Isolation (ASI)

Memory Management

Pre-Volta (Pascal and Earlier): Limited Isolation

On GPUs before Volta (e.g., GTX 1080, Tesla P100), MPS provides minimal process isolation:

All clients share the same virtual address space
No hardware memory protection
A buggy client can corrupt another client's GPU memory
Error propagation affects all clients

Recommendation: Only use pre-Volta MPS for trusted workloads or development environments, not multi-tenant production systems.

Volta and Later: Improved Isolation

Starting with Volta architecture (Tesla V100, RTX 2080, etc.), NVIDIA introduced hardware-level improvements:

Address Space Isolation (ASI): Each client gets its own GPU virtual address space
Memory Protection: Hardware prevents clients from accessing each other's memory
Fault Isolation: GPU faults in one client don't crash others (with caveats)
Better QoS: Improved scheduling fairness between clients
Compute Preemption: Long-running kernels can be preempted to improve responsiveness

Starting MPS

Basic Setup

# Ensure NVIDIA driver is loaded
nvidia-smi

# Set environment variables (optional, has defaults)
export CUDA_VISIBLE_DEVICES=0
export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps
export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps

# Start MPS control daemon
nvidia-cuda-mps-control -d

# Verify MPS is running
ps aux | grep mps
# You should see:
# nvidia-cuda-mps-control
# nvidia-cuda-mps-server

Configuration Options

Key Environment Variables:

CUDA_VISIBLE_DEVICES: Which GPUs are available to MPS
CUDA_MPS_PIPE_DIRECTORY: Location of communication pipes (default: /tmp/nvidia-mps)
CUDA_MPS_LOG_DIRECTORY: Location of log files (default: /var/log/nvidia-mps)
CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: Max % of device threads per client (0-100, default: 100)
CUDA_DEVICE_MAX_CONNECTIONS: Max concurrent streams per device (default: 8)

Interactive Control

# Interactive mode
nvidia-cuda-mps-control
nvidia-mps> get_server_list
nvidia-mps> get_device_client_list 0
nvidia-mps> set_default_active_thread_percentage 50
nvidia-mps> quit

# Non-interactive mode
echo "get_server_list" | nvidia-cuda-mps-control

# Gracefully stop MPS
echo "quit" | nvidia-cuda-mps-control

MPS vs Time-Slicing vs MIG

Time-Slicing (Default)

Isolation: Full process isolation
Utilization: Poor for small workloads
Overhead: High context switch cost
Use Case: Single large process or strict isolation needs

MPS (Multi-Process Service)

Isolation: Limited on pre-Volta, good on Volta+
Utilization: Excellent for small concurrent workloads
Overhead: Minimal
Use Case: Multiple small inference services, trusted multi-tenant

MIG (Multi-Instance GPU)

Isolation: Hardware-enforced partitioning
Utilization: Good but partitioned
Overhead: None
Use Case: Strict multi-tenant isolation (Ampere/Hopper only)

When to Use MPS

Ideal Use Cases

✅ Multiple small inference services sharing a GPU

✅ MPI applications with one process per GPU

✅ Microservices architecture with GPU workloads

✅ Development environments with multiple users

✅ Container orchestration (Kubernetes with NVIDIA device plugin)

Not Recommended For

❌ Single large training job (no benefit)

❌ Untrusted multi-tenant on pre-Volta GPUs

❌ Applications requiring strict QoS guarantees (use MIG instead)

❌ Workloads with large memory allocations competing for space

Best Practices

Use Volta+ GPUs for production multi-tenant scenarios
Set resource limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to prevent monopolization
Monitor via logs in CUDA_MPS_LOG_DIRECTORY
Use systemd service for production deployments
Test thoroughly before production—behavior varies by workload
Consider MIG for strict isolation requirements on Ampere+

Summary

CUDA Multi-Process Service (MPS) is a powerful feature for improving GPU utilization when running multiple small CUDA processes. By sharing a single GPU context through a client-server architecture, MPS eliminates context switching overhead and enables concurrent kernel execution across processes.

While MPS significantly improves utilization, understanding its isolation characteristics—especially the differences between pre-Volta and Volta+ architectures—is crucial for production deployments.

Table of Contents