CUDA Multi-Process Service (MPS)
Understand NVIDIA CUDA Multi-Process Service (MPS), a client-server architecture that enables multiple CUDA processes to share a single GPU context for concurrent kernel execution and better utilization.
Best viewed on desktop for optimal interactive experience
What is CUDA Multi-Process Service (MPS)?
CUDA Multi-Process Service (MPS) is a client-server architecture that enables multiple CUDA processes to share a single GPU context, allowing them to submit work concurrently to the GPU and achieve better utilization. Without MPS, CUDA contexts from different processes are time-sliced sequentially, leading to GPU underutilization when individual processes launch small kernels.
MPS eliminates this overhead by multiplexing work from multiple clients through a single server process that manages a shared GPU context.
The Problem: GPU Underutilization
Modern NVIDIA GPUs contain thousands of CUDA cores capable of executing work from multiple kernels simultaneously. However, the default CUDA execution model creates isolation between processes by giving each its own exclusive GPU context.
When multiple processes try to use the GPU, the driver time-slices these contexts—meaning only one process can submit work at a time, and context switches incur significant overhead.
Time-Slicing Issues
Consider a scenario where you have multiple small inference services running—each launches CUDA kernels that use only 20% of the GPU's streaming multiprocessors (SMs):
- Process A runs its kernel using 20% of GPU → 80% of SMs idle
 - Context switch overhead (~10-100 microseconds)
 - Process B runs its kernel using 20% of GPU → 80% of SMs idle
 - Context switch overhead
 - Process C runs → more idle time
 
The GPU spends most of its time either idle or switching contexts. With MPS, all three processes submit work concurrently through a shared context, and the GPU scheduler assigns them to different SMs simultaneously—achieving 60% utilization instead of 20%.
MPS Architecture
MPS operates through a client-server model with three key components:
1. MPS Control Daemon
- Binary: 
nvidia-cuda-mps-control - Role: Management interface
 - Functions:
- Start/stop MPS servers
 - Configure per-device settings
 - Handle client connections
 - Manage pipe directories
 
 
2. MPS Server
- Binary: 
nvidia-cuda-mps-server - Role: GPU context owner
 - Functions:
- Create shared GPU context
 - Multiplex CUDA calls from multiple clients
 - Submit kernels to GPU
 - Manage device memory
 
 
3. Client Library
- Library: 
libcuda.so(MPS-aware) - Role: Transparent interception
 - Functions:
- Intercept CUDA API calls
 - Route to MPS server via named pipes
 - Handle synchronization
 - Manage client state
 
 
How MPS Works
When a CUDA application runs under MPS, the execution flow changes fundamentally:
1. Application Launch
Client process starts and loads the CUDA runtime library (libcuda.so). If MPS environment variables are set, the library detects MPS mode.
2. MPS Connection
The CUDA library connects to the MPS control daemon via named pipes in /tmp/nvidia-mps/ (or path specified by CUDA_MPS_PIPE_DIRECTORY). Control daemon authenticates the client and provides a connection to the appropriate MPS server.
3. Context Initialization
Instead of creating its own GPU context, the client receives a handle to the shared context managed by the MPS server. This is transparent to the application—it still uses standard CUDA API calls.
4. Kernel Launch
When the application calls cudaLaunchKernel(), the CUDA library serializes the kernel parameters and sends them through the pipe to the MPS server. The server queues the work and submits it to the GPU using its shared context.
5. Concurrent Execution
The GPU's hardware scheduler receives kernels from multiple clients (via the single MPS server) and distributes them across available SMs. Kernels from different clients can execute simultaneously if resources permit.
6. Synchronization
When a client calls cudaDeviceSynchronize(), it waits on its own submitted work. The MPS server tracks which kernels belong to which client and signals completion appropriately.
Benefits of MPS
Reduced Overhead
Single GPU context eliminates context switch costs (50-100μs per switch)
Better GPU Utilization
Multiple small kernels can fill the GPU instead of leaving it mostly idle
Transparency
Applications require no code changes—MPS operates at the CUDA driver level
Simplified Management
Control daemon provides centralized administration
Improved Isolation (Volta+)
Hardware improvements on Volta and newer architectures provide better isolation between clients with Address Space Isolation (ASI)
Memory Management
Pre-Volta (Pascal and Earlier): Limited Isolation
On GPUs before Volta (e.g., GTX 1080, Tesla P100), MPS provides minimal process isolation:
- All clients share the same virtual address space
 - No hardware memory protection
 - A buggy client can corrupt another client's GPU memory
 - Error propagation affects all clients
 
Recommendation: Only use pre-Volta MPS for trusted workloads or development environments, not multi-tenant production systems.
Volta and Later: Improved Isolation
Starting with Volta architecture (Tesla V100, RTX 2080, etc.), NVIDIA introduced hardware-level improvements:
- Address Space Isolation (ASI): Each client gets its own GPU virtual address space
 - Memory Protection: Hardware prevents clients from accessing each other's memory
 - Fault Isolation: GPU faults in one client don't crash others (with caveats)
 - Better QoS: Improved scheduling fairness between clients
 - Compute Preemption: Long-running kernels can be preempted to improve responsiveness
 
Starting MPS
Basic Setup
# Ensure NVIDIA driver is loaded nvidia-smi # Set environment variables (optional, has defaults) export CUDA_VISIBLE_DEVICES=0 export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps # Start MPS control daemon nvidia-cuda-mps-control -d # Verify MPS is running ps aux | grep mps # You should see: # nvidia-cuda-mps-control # nvidia-cuda-mps-server
Configuration Options
Key Environment Variables:
CUDA_VISIBLE_DEVICES: Which GPUs are available to MPSCUDA_MPS_PIPE_DIRECTORY: Location of communication pipes (default:/tmp/nvidia-mps)CUDA_MPS_LOG_DIRECTORY: Location of log files (default:/var/log/nvidia-mps)CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: Max % of device threads per client (0-100, default: 100)CUDA_DEVICE_MAX_CONNECTIONS: Max concurrent streams per device (default: 8)
Interactive Control
# Interactive mode nvidia-cuda-mps-control nvidia-mps> get_server_list nvidia-mps> get_device_client_list 0 nvidia-mps> set_default_active_thread_percentage 50 nvidia-mps> quit # Non-interactive mode echo "get_server_list" | nvidia-cuda-mps-control # Gracefully stop MPS echo "quit" | nvidia-cuda-mps-control
MPS vs Time-Slicing vs MIG
Time-Slicing (Default)
- Isolation: Full process isolation
 - Utilization: Poor for small workloads
 - Overhead: High context switch cost
 - Use Case: Single large process or strict isolation needs
 
MPS (Multi-Process Service)
- Isolation: Limited on pre-Volta, good on Volta+
 - Utilization: Excellent for small concurrent workloads
 - Overhead: Minimal
 - Use Case: Multiple small inference services, trusted multi-tenant
 
MIG (Multi-Instance GPU)
- Isolation: Hardware-enforced partitioning
 - Utilization: Good but partitioned
 - Overhead: None
 - Use Case: Strict multi-tenant isolation (Ampere/Hopper only)
 
When to Use MPS
Ideal Use Cases
✅ Multiple small inference services sharing a GPU
✅ MPI applications with one process per GPU
✅ Microservices architecture with GPU workloads
✅ Development environments with multiple users
✅ Container orchestration (Kubernetes with NVIDIA device plugin)
Not Recommended For
❌ Single large training job (no benefit)
❌ Untrusted multi-tenant on pre-Volta GPUs
❌ Applications requiring strict QoS guarantees (use MIG instead)
❌ Workloads with large memory allocations competing for space
Best Practices
- Use Volta+ GPUs for production multi-tenant scenarios
 - Set resource limits via 
CUDA_MPS_ACTIVE_THREAD_PERCENTAGEto prevent monopolization - Monitor via logs in 
CUDA_MPS_LOG_DIRECTORY - Use systemd service for production deployments
 - Test thoroughly before production—behavior varies by workload
 - Consider MIG for strict isolation requirements on Ampere+
 
Summary
CUDA Multi-Process Service (MPS) is a powerful feature for improving GPU utilization when running multiple small CUDA processes. By sharing a single GPU context through a client-server architecture, MPS eliminates context switching overhead and enables concurrent kernel execution across processes.
While MPS significantly improves utilization, understanding its isolation characteristics—especially the differences between pre-Volta and Volta+ architectures—is crucial for production deployments.
