Understanding NVIDIA Kubernetes GPU Operator

What is the NVIDIA GPU Operator?

The NVIDIA GPU Operator automates the management of all NVIDIA software components required to run GPU-accelerated workloads in Kubernetes. Instead of manually installing GPU drivers, container runtime configurations, device plugins, and monitoring tools on every node, the GPU Operator handles everything through standard Kubernetes primitives.

It treats GPUs as a fully automated, software-defined resource that can be provisioned, configured, and upgraded declaratively—transforming GPU infrastructure into a cloud-native, self-managing system.

The Manual GPU Setup Problem

Before the GPU Operator, setting up GPU support in a Kubernetes cluster was a multi-step manual process prone to errors and inconsistencies across nodes. Consider what a cluster administrator had to do for each GPU node:

Figure 1: Manual GPU Setup vs. GPU Operator Approach

Why Manual Setup Fails at Scale

Configuration Drift: Each node might have different driver versions, toolkit versions, or configuration files, leading to unpredictable behavior.
Update Complexity: Upgrading drivers requires SSHing to nodes, potentially rebooting, and risking downtime or misconfiguration.
New Node Onboarding: Adding new GPU nodes to the cluster requires repeating the entire manual process.
No Self-Healing: If a component fails (e.g., device plugin crashes), it won't automatically recover without manual intervention.
Security Vulnerabilities: Outdated drivers or toolkits may have security issues, but tracking and patching across nodes is tedious.

The GPU Operator solves all of these problems by treating GPU infrastructure as software, managed through Kubernetes native primitives.

GPU Operator Architecture

The GPU Operator is built on the Operator Pattern—a Kubernetes design pattern where custom controllers extend the Kubernetes API to manage complex applications. The GPU Operator watches for GPU nodes and automatically deploys the entire GPU software stack as DaemonSets.

Figure 2: GPU Operator Architecture and Component Stack

Core Components Explained

The GPU Operator manages seven key components, each deployed as a DaemonSet on GPU nodes:

1. NVIDIA Driver DaemonSet

Purpose: Installs and manages NVIDIA kernel modules without polluting the host OS.

How it works: Runs a privileged container that compiles and loads kernel modules into the host kernel. The driver version is specified in the ClusterPolicy CRD and can be upgraded without SSH access to nodes.

# Driver container mounts host directories and loads modules
$ kubectl get ds -n gpu-operator nvidia-driver-daemonset
NAME                        DESIRED   CURRENT   READY
nvidia-driver-daemonset     3         3         3

# Verify driver loaded on nodes
$ kubectl exec -it <driver-pod> -n gpu-operator -- nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.129.03   Driver Version: 535.129.03   CUDA Version: 12.2   |
+-----------------------------------------------------------------------------+

Key benefit: Drivers are containerized and version-controlled. Rolling back a bad driver update is as simple as reverting a Helm value.

2. NVIDIA Container Toolkit

Purpose: Configures the container runtime (containerd/Docker/CRI-O) to expose GPUs to containers.

How it works: Installs nvidia-container-runtime-hook which intercepts container creation and injects GPU devices, libraries, and binaries into the container's filesystem.

# Container toolkit configures runtime to inject GPUs
# When pod requests nvidia.com/gpu: 1, the runtime:
# 1. Mounts /dev/nvidia* devices
# 2. Injects CUDA libraries (libcuda.so, libnvidia-ml.so)
# 3. Sets LD_LIBRARY_PATH
# 4. Exposes driver version to container

# Verify runtime sees GPUs
$ kubectl run gpu-test --rm -it --restart=Never \
    --image=nvidia/cuda:12.2.0-base-ubuntu22.04 \
    --limits=nvidia.com/gpu=1 -- nvidia-smi

GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-xxx)

3. NVIDIA Device Plugin

Purpose: Advertises GPU resources to the Kubernetes scheduler so pods can request GPUs.

How it works: Implements the Kubernetes Device Plugin API. Discovers GPUs on the node via NVML (NVIDIA Management Library), advertises them as nvidia.com/gpu, and handles allocation when pods request GPUs.

# Device plugin registers GPUs with kubelet
$ kubectl get nodes -o json | jq '.items[].status.allocatable | ."nvidia.com/gpu"'
"4"  # Node has 4 allocatable GPUs

# Pod requests GPU in resource limits
apiVersion: v1
kind: Pod
metadata:
  name: gpu-pod
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.2.0-base-ubuntu22.04
    resources:
      limits:
        nvidia.com/gpu: 1  # Scheduler allocates 1 GPU from available pool

Advanced capabilities: Supports MIG (Multi-Instance GPU), GPU time-slicing, and topology-aware scheduling.

4. DCGM Exporter

Purpose: Exports GPU telemetry metrics for Prometheus/Grafana monitoring.

How it works: Uses NVIDIA Data Center GPU Manager (DCGM) API to collect metrics like GPU utilization, memory usage, temperature, power consumption, and ECC errors. Exposes them on :9400/metrics in Prometheus format.

# Sample DCGM metrics
DCGM_FI_DEV_GPU_UTIL{gpu="0",UUID="GPU-xxx"} 85  # GPU 85% utilized
DCGM_FI_DEV_FB_USED{gpu="0"} 45056  # 45GB VRAM used
DCGM_FI_DEV_GPU_TEMP{gpu="0"} 72  # 72°C temperature
DCGM_FI_DEV_POWER_USAGE{gpu="0"} 420  # 420W power draw

# Prometheus scrapes these metrics for alerting and dashboards

5. GPU Feature Discovery (GFD)

Purpose: Labels nodes with GPU capabilities for advanced scheduling.

How it works: Queries GPU properties via NVML and adds labels like GPU model, CUDA compute capability, memory size, and driver version to the node's metadata.

# GFD adds labels to nodes
$ kubectl get nodes --show-labels | grep nvidia
node1   nvidia.com/cuda.driver.major=535
        nvidia.com/cuda.runtime.major=12
        nvidia.com/gpu.compute.major=8  # Ampere/Hopper
        nvidia.com/gpu.count=4
        nvidia.com/gpu.memory=81920  # 80GB per GPU
        nvidia.com/gpu.product=A100-SXM4-80GB

# Use labels in pod nodeSelector
spec:
  nodeSelector:
    nvidia.com/gpu.product: A100-SXM4-80GB  # Only schedule on A100 nodes
    nvidia.com/gpu.compute.major: "8"  # Requires Ampere or newer

6. Node Feature Discovery (NFD)

Purpose: General hardware feature discovery (CPU, PCI topology, kernel version).

How it works: Detects hardware features beyond GPUs, like AVX-512 support, PCIe topology, NUMA layout, enabling fine-grained scheduling based on hardware capabilities.

7. NVIDIA MIG Manager (Optional)

Purpose: Manages Multi-Instance GPU (MIG) partitioning on A100/H100 GPUs.

How it works: MIG allows splitting a single GPU into up to 7 isolated instances. The MIG Manager reads a ConfigMap defining desired MIG profiles and dynamically reconfigures GPUs, advertising MIG instances as separate resources (nvidia.com/mig-1g.5gb: 3).

# Example MIG configuration
apiVersion: v1
kind: ConfigMap
metadata:
  name: mig-parted-config
data:
  config.yaml: |
    version: v1
    mig-configs:
      all-1g.5gb:
        - devices: all
          mig-enabled: true
          mig-devices:
            "1g.5gb": 7  # Split each GPU into 7 × 1g.5gb instances

Component Startup Dependencies

Figure 3: Component Startup Dependencies

The components have strict ordering requirements:

Phase 1: Driver installation must complete first
Phase 2: Container toolkit configures the runtime
Phase 3: Device plugin, monitoring, and discovery run in parallel

Installation and Configuration

Basic Installation

# Add NVIDIA Helm repository
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.version="535.129.03"

Advanced Configuration

# values.yaml
driver:
  version: "535.129.03"
  repository: nvcr.io/nvidia

toolkit:
  version: v1.14.3

devicePlugin:
  enabled: true
  config:
    name: time-slicing-config

dcgm:
  enabled: true

mig:
  strategy: mixed  # single, mixed

nodeStatusExporter:
  enabled: true

Best Practices

1. Version Management

Always pin driver versions in production:

driver:
  version: "535.129.03"  # Don't use "latest"

2. Node Labeling

Use taints and labels to control GPU node scheduling:

# Taint GPU nodes to prevent non-GPU workloads
kubectl taint nodes gpu-node-1 nvidia.com/gpu=present:NoSchedule

# Label nodes by GPU type
kubectl label nodes gpu-node-1 gpu-type=a100

3. Resource Quotas

Set resource quotas to prevent GPU resource exhaustion:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: gpu-quota
spec:
  hard:
    nvidia.com/gpu: "8"  # Max 8 GPUs per namespace

4. Monitoring Setup

Deploy Prometheus and Grafana to visualize DCGM metrics:

# DCGM exporter metrics endpoint
kubectl port-forward -n gpu-operator \
  $(kubectl get pod -n gpu-operator -l app=nvidia-dcgm-exporter -o name) \
  9400:9400

5. Upgrade Strategy

Use canary deployments for driver upgrades:

# Create a canary node pool with new driver
kubectl label nodes gpu-node-1 driver-version=new

# Test workloads on canary nodes
# If successful, roll out to all nodes

Troubleshooting

Check Operator Status

# Verify operator is running
kubectl get pods -n gpu-operator

# Check ClusterPolicy status
kubectl get clusterpolicies.nvidia.com -n gpu-operator

# View operator logs
kubectl logs -n gpu-operator -l app=gpu-operator

Driver Issues

# Check driver DaemonSet status
kubectl get ds -n gpu-operator nvidia-driver-daemonset

# View driver logs
kubectl logs -n gpu-operator -l app=nvidia-driver-daemonset

# Verify driver loaded on host
kubectl exec -it <driver-pod> -n gpu-operator -- nvidia-smi

Device Plugin Issues

# Check if GPUs are advertised
kubectl get nodes -o json | jq '.items[].status.allocatable'

# View device plugin logs
kubectl logs -n gpu-operator -l app=nvidia-device-plugin-daemonset

Key Takeaways

Automation: The GPU Operator eliminates manual GPU setup, treating GPU infrastructure as declarative Kubernetes resources.
Consistency: All GPU nodes run identical software versions, preventing configuration drift.
Cloud-Native: Leverages Kubernetes primitives (DaemonSets, CRDs, Operators) for GPU lifecycle management.
Observability: Built-in metrics export via DCGM for comprehensive GPU monitoring.
Flexibility: Supports advanced features like MIG, time-slicing, and topology-aware scheduling.

The NVIDIA GPU Operator represents a paradigm shift in GPU cluster management—from imperative, error-prone manual processes to declarative, self-healing automation that scales seamlessly.

Table of Contents

Figure 1: Manual GPU Setup vs. GPU Operator Approach

Figure 2: GPU Operator Architecture and Component Stack

Figure 3: Component Startup Dependencies