CUDA Multi-Process Service (MPS)

Understand NVIDIA CUDA Multi-Process Service (MPS), a client-server architecture that enables multiple CUDA processes to share a single GPU context for concurrent kernel execution and better utilization.

Best viewed on desktop for optimal interactive experience

What is CUDA Multi-Process Service (MPS)?

CUDA Multi-Process Service (MPS) is a client-server architecture that enables multiple CUDA processes to share a single GPU context, allowing them to submit work concurrently to the GPU and achieve better utilization. Without MPS, CUDA contexts from different processes are time-sliced sequentially, leading to GPU underutilization when individual processes launch small kernels.

MPS eliminates this overhead by multiplexing work from multiple clients through a single server process that manages a shared GPU context.

The Problem: GPU Underutilization

Modern NVIDIA GPUs contain thousands of CUDA cores capable of executing work from multiple kernels simultaneously. However, the default CUDA execution model creates isolation between processes by giving each its own exclusive GPU context.

When multiple processes try to use the GPU, the driver time-slices these contexts—meaning only one process can submit work at a time, and context switches incur significant overhead.

Time-Slicing Issues

Consider a scenario where you have multiple small inference services running—each launches CUDA kernels that use only 20% of the GPU's streaming multiprocessors (SMs):

  • Process A runs its kernel using 20% of GPU → 80% of SMs idle
  • Context switch overhead (~10-100 microseconds)
  • Process B runs its kernel using 20% of GPU → 80% of SMs idle
  • Context switch overhead
  • Process C runs → more idle time

The GPU spends most of its time either idle or switching contexts. With MPS, all three processes submit work concurrently through a shared context, and the GPU scheduler assigns them to different SMs simultaneously—achieving 60% utilization instead of 20%.

MPS Architecture

CUDA MPS ArchitectureDiagram illustrating the CUDA Multi-Process Service (MPS) architecture, showing the client-server model with multiple CUDA clients connecting to an MPS server that manages a shared GPU context for concurrent execution.MPS Client-Server ArchitectureCUDA Client ProcessesClient AInferenceServiceClient BTrainingJobClient CBatchProcessinglibcuda.so (MPS-aware CUDA runtime)CommunicationNamed pipes in:/tmp/nvidia-mps/controlserver_XXXlog_XXXCUDA callsMPS ControlDaemonnvidia-cuda-mps-controlMPS Servernvidia-cuda-mps-serverManages sharedGPU contextMultiplexes workspawnsWITHOUT MPS:3 separate GPU contextsTime-sliced executionHigh context switch overheadWITH MPS:1 shared GPU contextConcurrent kernel executionMinimal scheduling overheadGPU HardwareStreaming Multiprocessors (SMs) • Global Memory • L2 Cache • Concurrent kernel executionDirectaccess

MPS operates through a client-server model with three key components:

1. MPS Control Daemon

  • Binary: nvidia-cuda-mps-control
  • Role: Management interface
  • Functions:
    • Start/stop MPS servers
    • Configure per-device settings
    • Handle client connections
    • Manage pipe directories

2. MPS Server

  • Binary: nvidia-cuda-mps-server
  • Role: GPU context owner
  • Functions:
    • Create shared GPU context
    • Multiplex CUDA calls from multiple clients
    • Submit kernels to GPU
    • Manage device memory

3. Client Library

  • Library: libcuda.so (MPS-aware)
  • Role: Transparent interception
  • Functions:
    • Intercept CUDA API calls
    • Route to MPS server via named pipes
    • Handle synchronization
    • Manage client state

How MPS Works

When a CUDA application runs under MPS, the execution flow changes fundamentally:

1. Application Launch

Client process starts and loads the CUDA runtime library (libcuda.so). If MPS environment variables are set, the library detects MPS mode.

2. MPS Connection

The CUDA library connects to the MPS control daemon via named pipes in /tmp/nvidia-mps/ (or path specified by CUDA_MPS_PIPE_DIRECTORY). Control daemon authenticates the client and provides a connection to the appropriate MPS server.

3. Context Initialization

Instead of creating its own GPU context, the client receives a handle to the shared context managed by the MPS server. This is transparent to the application—it still uses standard CUDA API calls.

4. Kernel Launch

When the application calls cudaLaunchKernel(), the CUDA library serializes the kernel parameters and sends them through the pipe to the MPS server. The server queues the work and submits it to the GPU using its shared context.

5. Concurrent Execution

The GPU's hardware scheduler receives kernels from multiple clients (via the single MPS server) and distributes them across available SMs. Kernels from different clients can execute simultaneously if resources permit.

6. Synchronization

When a client calls cudaDeviceSynchronize(), it waits on its own submitted work. The MPS server tracks which kernels belong to which client and signals completion appropriately.

Benefits of MPS

Reduced Overhead

Single GPU context eliminates context switch costs (50-100μs per switch)

Better GPU Utilization

Multiple small kernels can fill the GPU instead of leaving it mostly idle

Transparency

Applications require no code changes—MPS operates at the CUDA driver level

Simplified Management

Control daemon provides centralized administration

Improved Isolation (Volta+)

Hardware improvements on Volta and newer architectures provide better isolation between clients with Address Space Isolation (ASI)

Memory Management

Pre-Volta (Pascal and Earlier): Limited Isolation

On GPUs before Volta (e.g., GTX 1080, Tesla P100), MPS provides minimal process isolation:

  • All clients share the same virtual address space
  • No hardware memory protection
  • A buggy client can corrupt another client's GPU memory
  • Error propagation affects all clients

Recommendation: Only use pre-Volta MPS for trusted workloads or development environments, not multi-tenant production systems.

Volta and Later: Improved Isolation

Starting with Volta architecture (Tesla V100, RTX 2080, etc.), NVIDIA introduced hardware-level improvements:

  • Address Space Isolation (ASI): Each client gets its own GPU virtual address space
  • Memory Protection: Hardware prevents clients from accessing each other's memory
  • Fault Isolation: GPU faults in one client don't crash others (with caveats)
  • Better QoS: Improved scheduling fairness between clients
  • Compute Preemption: Long-running kernels can be preempted to improve responsiveness

Starting MPS

Basic Setup

# Ensure NVIDIA driver is loaded nvidia-smi # Set environment variables (optional, has defaults) export CUDA_VISIBLE_DEVICES=0 export CUDA_MPS_PIPE_DIRECTORY=/tmp/nvidia-mps export CUDA_MPS_LOG_DIRECTORY=/var/log/nvidia-mps # Start MPS control daemon nvidia-cuda-mps-control -d # Verify MPS is running ps aux | grep mps # You should see: # nvidia-cuda-mps-control # nvidia-cuda-mps-server

Configuration Options

Key Environment Variables:

  • CUDA_VISIBLE_DEVICES: Which GPUs are available to MPS
  • CUDA_MPS_PIPE_DIRECTORY: Location of communication pipes (default: /tmp/nvidia-mps)
  • CUDA_MPS_LOG_DIRECTORY: Location of log files (default: /var/log/nvidia-mps)
  • CUDA_MPS_ACTIVE_THREAD_PERCENTAGE: Max % of device threads per client (0-100, default: 100)
  • CUDA_DEVICE_MAX_CONNECTIONS: Max concurrent streams per device (default: 8)

Interactive Control

# Interactive mode nvidia-cuda-mps-control nvidia-mps> get_server_list nvidia-mps> get_device_client_list 0 nvidia-mps> set_default_active_thread_percentage 50 nvidia-mps> quit # Non-interactive mode echo "get_server_list" | nvidia-cuda-mps-control # Gracefully stop MPS echo "quit" | nvidia-cuda-mps-control

MPS vs Time-Slicing vs MIG

Time-Slicing (Default)

  • Isolation: Full process isolation
  • Utilization: Poor for small workloads
  • Overhead: High context switch cost
  • Use Case: Single large process or strict isolation needs

MPS (Multi-Process Service)

  • Isolation: Limited on pre-Volta, good on Volta+
  • Utilization: Excellent for small concurrent workloads
  • Overhead: Minimal
  • Use Case: Multiple small inference services, trusted multi-tenant

MIG (Multi-Instance GPU)

  • Isolation: Hardware-enforced partitioning
  • Utilization: Good but partitioned
  • Overhead: None
  • Use Case: Strict multi-tenant isolation (Ampere/Hopper only)

When to Use MPS

Ideal Use Cases

Multiple small inference services sharing a GPU

MPI applications with one process per GPU

Microservices architecture with GPU workloads

Development environments with multiple users

Container orchestration (Kubernetes with NVIDIA device plugin)

Single large training job (no benefit)

Untrusted multi-tenant on pre-Volta GPUs

Applications requiring strict QoS guarantees (use MIG instead)

Workloads with large memory allocations competing for space

Best Practices

  1. Use Volta+ GPUs for production multi-tenant scenarios
  2. Set resource limits via CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to prevent monopolization
  3. Monitor via logs in CUDA_MPS_LOG_DIRECTORY
  4. Use systemd service for production deployments
  5. Test thoroughly before production—behavior varies by workload
  6. Consider MIG for strict isolation requirements on Ampere+

Summary

CUDA Multi-Process Service (MPS) is a powerful feature for improving GPU utilization when running multiple small CUDA processes. By sharing a single GPU context through a client-server architecture, MPS eliminates context switching overhead and enables concurrent kernel execution across processes.

While MPS significantly improves utilization, understanding its isolation characteristics—especially the differences between pre-Volta and Volta+ architectures—is crucial for production deployments.

If you found this explanation helpful, consider sharing it with others.

Mastodon