Skip to content

Validate setup for ComputeDomain allocation

Dr. Jan-Philip Gehrcke edited this page Sep 3, 2025 · 9 revisions

Prerequisites

This page assumes that you have followed the installation instructions, and that all relevant GPU Operator components are running, and in a Ready state.

As a quick reminder (to overcome common sources of error):

  • Use Kubernetes 1.32 or later, with DRA and CDI enabled on all nodes (docs, more docs).
  • If you have nvidia-imex-* packages installed (via your Linux distribution's package manager): disable the nvidia-imex.service systemd unit (on all GPU nodes), with e.g. systemctl disable --now nvidia-imex.service && systemctl mask nvidia-imex.service.

Validate that DRA driver is running

Example:

$ kubectl get pod -n nvidia-dra-driver-gpu
NAME                                                           READY   STATUS    RESTARTS   AGE
nvidia-dra-driver-k8s-dra-driver-controller-67cb99d84b-5q7kj   1/1     Running   0          7m26s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-7kdg9          1/1     Running   0          7m27s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bd6gn          1/1     Running   0          7m27s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bzm6p          1/1     Running   0          7m26s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xjm4p          1/1     Running   0          7m27s

Confirm all expected nodes run a *-k8s-dra-driver-kubelet-plugin-* pod, and that the READY column indicates readiness for all listed pods.

Validate clique node labels

Next up, a meaningful validation step is to confirm that all GPU nodes have been decorated with the Kubernetes node label nvidia.com/gpu.clique. Example:

$ (echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \
    /usr/bin/jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \
    column -t
NODE                 LABEL                  CLIQUE
gb-nvl-043-bianca-1  nvidia.com/gpu.clique  9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
gb-nvl-043-bianca-2  nvidia.com/gpu.clique  9277d399-0674-44a9-b64e-d85bb19ce2b0.32766

Notes for troubleshooting:

  • The label value is expected to have the shape <CLIQUE_UUID,CLIQUE_ID>.
  • The GPU Feature Discovery component of the GPU Operator inspects the nodes and sets these labels (docs).
  • On a given node, per-GPU clique configuration can be inspected with e.g. nvidia-smi -q | grep -E "ClusterUUID|CliqueId".

Run validation workloads

1) IMEX channel injection test

Run a simple test to validate that IMEX daemons are started and IMEX channels are injected:

cat <<EOF > imex-channel-injection.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 1
  channel:
    resourceClaimTemplate:
      name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0
EOF
$ kubectl apply -f imex-channel-injection.yaml
computedomain.resource.nvidia.com/imex-channel-injection created
pod/imex-channel-injection created
$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
imex-channel-injection   1/1     Running   0          3s
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
NAME                                 READY   STATUS    RESTARTS   AGE
imex-channel-injection-6k9sx-ffgpf   1/1     Running   0          3s
$ kubectl logs imex-channel-injection
total 0
drwxr-xr-x 2 root root     60 Feb 19 10:43 .
drwxr-xr-x 6 root root    380 Feb 19 10:43 ..
crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
$ kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
I0731 14:57:34.920143       1 main.go:176] config: &{gb-nvl-043-compute07 e273cacb-141a-478b-9c24-263c784026b9 imex-channel-injection default 6a130f54-faaa-4b8f-847f-be44ab70f917.32766 192.168.34.137}
[...]
I0731 14:57:34.920644       1 process.go:152] Start watchdog
I0731 14:57:34.920685       1 main.go:233] wait for nodes update
[...]
I0731 14:57:34.926056       1 reflector.go:436] "Caches populated" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
[...]
I0731 14:57:35.024250       1 round_trippers.go:632] "Response" verb="PUT" url="https://10.96.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=2
I0731 14:57:35.024442       1 computedomain.go:214] IP set changed: previous: map[]; new: map[192.168.34.137:{}]
I0731 14:57:35.024601       1 main.go:331] Current /etc/nvidia-imex/nodes_config.cfg:
192.168.34.137
I0731 14:57:35.024610       1 main.go:243] Got update, (re)start IMEX daemon
I0731 14:57:35.024616       1 process.go:67] Start: /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
I0731 14:57:35.024964       1 process.go:92] Started process with pid 47
I0731 14:57:35.024970       1 main.go:233] wait for nodes update
I0731 14:57:35.029705       1 computedomain.go:218] IP set did not change
WARNING: failed to open IMEX log file  errno = No such file or directory
INFO: using stderr for IMEX logging
IMEX Log initializing at: 7/31/2025 14:57:35.030
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX version 570.133.20 is running with the following configuration options
[Jul 31 2025 14:57:35] [INFO] [tid 47] Logging level = 4
[Jul 31 2025 14:57:35] [INFO] [tid 47] Logging file name/path = 
[Jul 31 2025 14:57:35] [INFO] [tid 47] Append to log file = 0
[Jul 31 2025 14:57:35] [INFO] [tid 47] Max Log file size = 1024 (MBs)
[Jul 31 2025 14:57:35] [INFO] [tid 47] Use Syslog file = 0
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX Library communication bind interface = 
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX library communication bind port = 50000
[Jul 31 2025 14:57:35] [INFO] [tid 47] Identified this node as ID 0, using bind IP of '192.168.34.137', and network interface of eth0
[Jul 31 2025 14:57:35] [INFO] [tid 47] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
[Jul 31 2025 14:57:35] [INFO] [tid 47] NvGpu Library version matched with GPU Driver version
[Jul 31 2025 14:57:35] [INFO] [tid 71] Started processing of incoming messages.
[...]
[Jul 31 2025 14:57:35] [INFO] [tid 47] Creating gRPC channels to all peers (nPeers = 1).
[Jul 31 2025 14:57:35] [INFO] [tid 74] Started processing of incoming messages.
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[Jul 31 2025 14:57:35] [INFO] [tid 75] Connection established to node 0 with ip address 192.168.34.137. Number of times connected: 1
[Jul 31 2025 14:57:35] [INFO] [tid 47] GPU event successfully subscribed
I0731 14:57:36.823729       1 computedomain.go:218] IP set did not change
[...]

Clean up:

$ kubectl delete -f imex-channel-injection.yaml
computedomain.resource.nvidia.com "imex-channel-injection" deleted
pod "imex-channel-injection" deleted

2) Multi-node nvbandwidth test (with MPI)

A two-node nvbandwidth test that consumes four GPUs on each node.

Install the MPI Operator

kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml

Create the spec file

cat <<EOF > nvbandwidth-test-job.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: nvbandwidth-test-compute-domain
spec:
  numNodes: 2
  channel:
    resourceClaimTemplate:
      name: nvbandwidth-test-compute-domain-channel

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-launcher
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:4:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
            - -t
            - multinode_device_to_device_memcpy_read_ce
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-worker
        spec:
          affinity:
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: nvbandwidth-test-replica
                    operator: In
                    values:
                    - mpi-worker
                topologyKey: nvidia.com/gpu.clique
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 4
              claims:
              - name: compute-domain-channel
          resourceClaims:
          - name: compute-domain-channel
            resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
EOF

Apply the spec, and inspect output

$ kubectl apply -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain created
mpijob.kubeflow.org/nvbandwidth-test created
$ kubectl get pods
NAME                              READY   STATUS      RESTARTS   AGE
nvbandwidth-test-launcher-lzv84   1/1     Running     0          3s
nvbandwidth-test-worker-0         1/1     Running     0          15s
nvbandwidth-test-worker-1         1/1     Running     0          15s
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
NAME                                          READY   STATUS    RESTARTS   AGE
nvbandwidth-test-compute-domain-ht24d-9jhmj   1/1     Running   0          20s
nvbandwidth-test-compute-domain-ht24d-rcn2c   1/1     Running   0          20s
$ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
[nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 4 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 5 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 6 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
nvbandwidth Version: v0.7
Built from Git version: v0.7

MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
CUDA Runtime Version: 12080
CUDA Driver Version: 12080
Driver Version: 570.124.06

Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00)
Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00)
Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00)
Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00)
Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00)
Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00)
Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00)
Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00)

Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0       N/A    798.02    798.25    798.02    798.02    797.88    797.73    797.95
 1    798.10       N/A    797.80    798.02    798.02    798.25    797.88    798.02
 2    797.95    797.95       N/A    797.73    797.80    797.95    797.95    797.65
 3    798.10    798.02    797.95       N/A    798.02    798.10    797.88    797.73
 4    797.80    798.02    798.02    798.02       N/A    797.95    797.80    798.02
 5    797.80    797.95    798.10    798.10    797.95       N/A    797.95    797.88
 6    797.73    797.95    798.10    798.02    797.95    797.88       N/A    797.80
 7    797.88    798.02    797.95    798.02    797.88    797.95    798.02       N/A

SUM multinode_device_to_device_memcpy_read_ce 44685.29

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

Clean up

$ kubectl delete -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com "nvbandwidth-test-compute-domain" deleted
mpijob.kubeflow.org "nvbandwidth-test" deleted