Understanding NVIDIA Kubernetes GPU Operator
Explore how the NVIDIA GPU Operator automates GPU infrastructure management in Kubernetes, transforming manual GPU setup into a declarative, cloud-native system.
Best viewed on desktop for optimal interactive experience
What is the NVIDIA GPU Operator?
The NVIDIA GPU Operator automates the management of all NVIDIA software components required to run GPU-accelerated workloads in Kubernetes. Instead of manually installing GPU drivers, container runtime configurations, device plugins, and monitoring tools on every node, the GPU Operator handles everything through standard Kubernetes primitives.
It treats GPUs as a fully automated, software-defined resource that can be provisioned, configured, and upgraded declaratively—transforming GPU infrastructure into a cloud-native, self-managing system.
The Manual GPU Setup Problem
Before the GPU Operator, setting up GPU support in a Kubernetes cluster was a multi-step manual process prone to errors and inconsistencies across nodes. Consider what a cluster administrator had to do for each GPU node:
Figure 1: Manual GPU Setup vs. GPU Operator Approach
Why Manual Setup Fails at Scale
-
Configuration Drift: Each node might have different driver versions, toolkit versions, or configuration files, leading to unpredictable behavior.
-
Update Complexity: Upgrading drivers requires SSHing to nodes, potentially rebooting, and risking downtime or misconfiguration.
-
New Node Onboarding: Adding new GPU nodes to the cluster requires repeating the entire manual process.
-
No Self-Healing: If a component fails (e.g., device plugin crashes), it won't automatically recover without manual intervention.
-
Security Vulnerabilities: Outdated drivers or toolkits may have security issues, but tracking and patching across nodes is tedious.
The GPU Operator solves all of these problems by treating GPU infrastructure as software, managed through Kubernetes native primitives.
GPU Operator Architecture
The GPU Operator is built on the Operator Pattern—a Kubernetes design pattern where custom controllers extend the Kubernetes API to manage complex applications. The GPU Operator watches for GPU nodes and automatically deploys the entire GPU software stack as DaemonSets.
Figure 2: GPU Operator Architecture and Component Stack
Core Components Explained
The GPU Operator manages seven key components, each deployed as a DaemonSet on GPU nodes:
1. NVIDIA Driver DaemonSet
Purpose: Installs and manages NVIDIA kernel modules without polluting the host OS.
How it works: Runs a privileged container that compiles and loads kernel modules into the host kernel. The driver version is specified in the ClusterPolicy CRD and can be upgraded without SSH access to nodes.
# Driver container mounts host directories and loads modules $ kubectl get ds -n gpu-operator nvidia-driver-daemonset NAME DESIRED CURRENT READY nvidia-driver-daemonset 3 3 3 # Verify driver loaded on nodes $ kubectl exec -it <driver-pod> -n gpu-operator -- nvidia-smi +-----------------------------------------------------------------------------+ | NVIDIA-SMI 535.129.03 Driver Version: 535.129.03 CUDA Version: 12.2 | +-----------------------------------------------------------------------------+
Key benefit: Drivers are containerized and version-controlled. Rolling back a bad driver update is as simple as reverting a Helm value.
2. NVIDIA Container Toolkit
Purpose: Configures the container runtime (containerd/Docker/CRI-O) to expose GPUs to containers.
How it works: Installs nvidia-container-runtime-hook which intercepts container creation and injects GPU devices, libraries, and binaries into the container's filesystem.
# Container toolkit configures runtime to inject GPUs # When pod requests nvidia.com/gpu: 1, the runtime: # 1. Mounts /dev/nvidia* devices # 2. Injects CUDA libraries (libcuda.so, libnvidia-ml.so) # 3. Sets LD_LIBRARY_PATH # 4. Exposes driver version to container # Verify runtime sees GPUs $ kubectl run gpu-test --rm -it --restart=Never \ --image=nvidia/cuda:12.2.0-base-ubuntu22.04 \ --limits=nvidia.com/gpu=1 -- nvidia-smi GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-xxx)
3. NVIDIA Device Plugin
Purpose: Advertises GPU resources to the Kubernetes scheduler so pods can request GPUs.
How it works: Implements the Kubernetes Device Plugin API. Discovers GPUs on the node via NVML (NVIDIA Management Library), advertises them as nvidia.com/gpu, and handles allocation when pods request GPUs.
# Device plugin registers GPUs with kubelet $ kubectl get nodes -o json | jq '.items[].status.allocatable | ."nvidia.com/gpu"' "4" # Node has 4 allocatable GPUs # Pod requests GPU in resource limits apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:12.2.0-base-ubuntu22.04 resources: limits: nvidia.com/gpu: 1 # Scheduler allocates 1 GPU from available pool
Advanced capabilities: Supports MIG (Multi-Instance GPU), GPU time-slicing, and topology-aware scheduling.
4. DCGM Exporter
Purpose: Exports GPU telemetry metrics for Prometheus/Grafana monitoring.
How it works: Uses NVIDIA Data Center GPU Manager (DCGM) API to collect metrics like GPU utilization, memory usage, temperature, power consumption, and ECC errors. Exposes them on :9400/metrics in Prometheus format.
# Sample DCGM metrics DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-xxx"} 85 # GPU 85% utilized DCGM_FI_DEV_FB_USED{gpu="0"} 45056 # 45GB VRAM used DCGM_FI_DEV_GPU_TEMP{gpu="0"} 72 # 72°C temperature DCGM_FI_DEV_POWER_USAGE{gpu="0"} 420 # 420W power draw # Prometheus scrapes these metrics for alerting and dashboards
5. GPU Feature Discovery (GFD)
Purpose: Labels nodes with GPU capabilities for advanced scheduling.
How it works: Queries GPU properties via NVML and adds labels like GPU model, CUDA compute capability, memory size, and driver version to the node's metadata.
# GFD adds labels to nodes $ kubectl get nodes --show-labels | grep nvidia node1 nvidia.com/cuda.driver.major=535 nvidia.com/cuda.runtime.major=12 nvidia.com/gpu.compute.major=8 # Ampere/Hopper nvidia.com/gpu.count=4 nvidia.com/gpu.memory=81920 # 80GB per GPU nvidia.com/gpu.product=A100-SXM4-80GB # Use labels in pod nodeSelector spec: nodeSelector: nvidia.com/gpu.product: A100-SXM4-80GB # Only schedule on A100 nodes nvidia.com/gpu.compute.major: "8" # Requires Ampere or newer
6. Node Feature Discovery (NFD)
Purpose: General hardware feature discovery (CPU, PCI topology, kernel version).
How it works: Detects hardware features beyond GPUs, like AVX-512 support, PCIe topology, NUMA layout, enabling fine-grained scheduling based on hardware capabilities.
7. NVIDIA MIG Manager (Optional)
Purpose: Manages Multi-Instance GPU (MIG) partitioning on A100/H100 GPUs.
How it works: MIG allows splitting a single GPU into up to 7 isolated instances. The MIG Manager reads a ConfigMap defining desired MIG profiles and dynamically reconfigures GPUs, advertising MIG instances as separate resources (nvidia.com/mig-1g.5gb: 3).
# Example MIG configuration apiVersion: v1 kind: ConfigMap metadata: name: mig-parted-config data: config.yaml: | version: v1 mig-configs: all-1g.5gb: - devices: all mig-enabled: true mig-devices: "1g.5gb": 7 # Split each GPU into 7 × 1g.5gb instances
Component Startup Dependencies
Figure 3: Component Startup Dependencies
The components have strict ordering requirements:
- Phase 1: Driver installation must complete first
- Phase 2: Container toolkit configures the runtime
- Phase 3: Device plugin, monitoring, and discovery run in parallel
Installation and Configuration
Basic Installation
# Add NVIDIA Helm repository helm repo add nvidia https://helm.ngc.nvidia.com/nvidia helm repo update # Install GPU Operator helm install gpu-operator nvidia/gpu-operator \ --namespace gpu-operator \ --create-namespace \ --set driver.version="535.129.03"
Advanced Configuration
# values.yaml driver: version: "535.129.03" repository: nvcr.io/nvidia toolkit: version: v1.14.3 devicePlugin: enabled: true config: name: time-slicing-config dcgm: enabled: true mig: strategy: mixed # single, mixed nodeStatusExporter: enabled: true
Best Practices
1. Version Management
Always pin driver versions in production:
driver: version: "535.129.03" # Don't use "latest"
2. Node Labeling
Use taints and labels to control GPU node scheduling:
# Taint GPU nodes to prevent non-GPU workloads kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule # Label nodes by GPU type kubectl label nodes gpu-node-1 gpu-type=a100
3. Resource Quotas
Set resource quotas to prevent GPU resource exhaustion:
apiVersion: v1 kind: ResourceQuota metadata: name: gpu-quota spec: hard: nvidia.com/gpu: "8" # Max 8 GPUs per namespace
4. Monitoring Setup
Deploy Prometheus and Grafana to visualize DCGM metrics:
# DCGM exporter metrics endpoint kubectl port-forward -n gpu-operator \ $(kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o name) \ 9400:9400
5. Upgrade Strategy
Use canary deployments for driver upgrades:
# Create a canary node pool with new driver kubectl label nodes gpu-node-1 driver-version=new # Test workloads on canary nodes # If successful, roll out to all nodes
Troubleshooting
Check Operator Status
# Verify operator is running kubectl get pods -n gpu-operator # Check ClusterPolicy status kubectl get clusterpolicies.nvidia.com -n gpu-operator # View operator logs kubectl logs -n gpu-operator -l app=gpu-operator
Driver Issues
# Check driver DaemonSet status kubectl get ds -n gpu-operator nvidia-driver-daemonset # View driver logs kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset # Verify driver loaded on host kubectl exec -it <driver-pod> -n gpu-operator -- nvidia-smi
Device Plugin Issues
# Check if GPUs are advertised kubectl get nodes -o json | jq '.items[].status.allocatable' # View device plugin logs kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset
Key Takeaways
-
Automation: The GPU Operator eliminates manual GPU setup, treating GPU infrastructure as declarative Kubernetes resources.
-
Consistency: All GPU nodes run identical software versions, preventing configuration drift.
-
Cloud-Native: Leverages Kubernetes primitives (DaemonSets, CRDs, Operators) for GPU lifecycle management.
-
Observability: Built-in metrics export via DCGM for comprehensive GPU monitoring.
-
Flexibility: Supports advanced features like MIG, time-slicing, and topology-aware scheduling.
The NVIDIA GPU Operator represents a paradigm shift in GPU cluster management—from imperative, error-prone manual processes to declarative, self-healing automation that scales seamlessly.
