Validate setup for ComputeDomain allocation

Prerequisites

This page assumes that you have followed the installation instructions, and that all relevant GPU Operator components are running, and in a Ready state.

As a quick reminder (to overcome common sources of error):

Use Kubernetes 1.32 or later, with DRA and CDI enabled on all nodes (docs, more docs).
If you have nvidia-imex-* packages installed (via your Linux distribution's package manager): disable the nvidia-imex.service systemd unit (on all GPU nodes), with e.g. systemctl disable --now nvidia-imex.service && systemctl mask nvidia-imex.service.

Validate that DRA driver is running

Example:

$ kubectl get pod -n nvidia-dra-driver-gpu
NAME                                                           READY   STATUS    RESTARTS   AGE
nvidia-dra-driver-k8s-dra-driver-controller-67cb99d84b-5q7kj   1/1     Running   0          7m26s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-7kdg9          1/1     Running   0          7m27s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bd6gn          1/1     Running   0          7m27s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bzm6p          1/1     Running   0          7m26s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xjm4p          1/1     Running   0          7m27s

Confirm all expected nodes run a *-k8s-dra-driver-kubelet-plugin-* pod, and that the READY column indicates readiness for all listed pods.

Validate clique node labels

Next up, a meaningful validation step is to confirm that all GPU nodes have been decorated with the Kubernetes node label nvidia.com/gpu.clique. Example:

$ (echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \
    /usr/bin/jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \
    column -t
NODE                 LABEL                  CLIQUE
gb-nvl-043-bianca-1  nvidia.com/gpu.clique  9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
gb-nvl-043-bianca-2  nvidia.com/gpu.clique  9277d399-0674-44a9-b64e-d85bb19ce2b0.32766

Notes for troubleshooting:

The label value is expected to have the shape <CLIQUE_UUID,CLIQUE_ID>.
The GPU Feature Discovery component of the GPU Operator inspects the nodes and sets these labels (docs).
On a given node, per-GPU clique configuration can be inspected with e.g. nvidia-smi -q | grep -E "ClusterUUID|CliqueId".

Run validation workloads

1) IMEX channel injection test

Run a simple test to validate that IMEX daemons are started and IMEX channels are injected:

cat <<EOF > imex-channel-injection.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: imex-channel-injection
spec:
  numNodes: 1
  channel:
    resourceClaimTemplate:
      name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
  name: imex-channel-injection
spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: nvidia.com/gpu.clique
            operator: Exists
  containers:
  - name: ctr
    image: ubuntu:22.04
    command: ["bash", "-c"]
    args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
    resources:
      claims:
      - name: imex-channel-0
  resourceClaims:
  - name: imex-channel-0
    resourceClaimTemplateName: imex-channel-0
EOF

$ kubectl apply -f imex-channel-injection.yaml
computedomain.resource.nvidia.com/imex-channel-injection created
pod/imex-channel-injection created

$ kubectl get pods
NAME                     READY   STATUS    RESTARTS   AGE
imex-channel-injection   1/1     Running   0          3s

$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
NAME                                 READY   STATUS    RESTARTS   AGE
imex-channel-injection-6k9sx-ffgpf   1/1     Running   0          3s

$ kubectl logs imex-channel-injection
total 0
drwxr-xr-x 2 root root     60 Feb 19 10:43 .
drwxr-xr-x 6 root root    380 Feb 19 10:43 ..
crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0

$ kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
I0731 14:57:34.920143       1 main.go:176] config: &{gb-nvl-043-compute07 e273cacb-141a-478b-9c24-263c784026b9 imex-channel-injection default 6a130f54-faaa-4b8f-847f-be44ab70f917.32766 192.168.34.137}
[...]
I0731 14:57:34.920644       1 process.go:152] Start watchdog
I0731 14:57:34.920685       1 main.go:233] wait for nodes update
[...]
I0731 14:57:34.926056       1 reflector.go:436] "Caches populated" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
[...]
I0731 14:57:35.024250       1 round_trippers.go:632] "Response" verb="PUT" url="https://10.96.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=2
I0731 14:57:35.024442       1 computedomain.go:214] IP set changed: previous: map[]; new: map[192.168.34.137:{}]
I0731 14:57:35.024601       1 main.go:331] Current /etc/nvidia-imex/nodes_config.cfg:
192.168.34.137
I0731 14:57:35.024610       1 main.go:243] Got update, (re)start IMEX daemon
I0731 14:57:35.024616       1 process.go:67] Start: /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
I0731 14:57:35.024964       1 process.go:92] Started process with pid 47
I0731 14:57:35.024970       1 main.go:233] wait for nodes update
I0731 14:57:35.029705       1 computedomain.go:218] IP set did not change
WARNING: failed to open IMEX log file  errno = No such file or directory
INFO: using stderr for IMEX logging
IMEX Log initializing at: 7/31/2025 14:57:35.030
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX version 570.133.20 is running with the following configuration options
[Jul 31 2025 14:57:35] [INFO] [tid 47] Logging level = 4
[Jul 31 2025 14:57:35] [INFO] [tid 47] Logging file name/path = 
[Jul 31 2025 14:57:35] [INFO] [tid 47] Append to log file = 0
[Jul 31 2025 14:57:35] [INFO] [tid 47] Max Log file size = 1024 (MBs)
[Jul 31 2025 14:57:35] [INFO] [tid 47] Use Syslog file = 0
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX Library communication bind interface = 
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX library communication bind port = 50000
[Jul 31 2025 14:57:35] [INFO] [tid 47] Identified this node as ID 0, using bind IP of '192.168.34.137', and network interface of eth0
[Jul 31 2025 14:57:35] [INFO] [tid 47] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist.  Assuming no previous importers.
[Jul 31 2025 14:57:35] [INFO] [tid 47] NvGpu Library version matched with GPU Driver version
[Jul 31 2025 14:57:35] [INFO] [tid 71] Started processing of incoming messages.
[...]
[Jul 31 2025 14:57:35] [INFO] [tid 47] Creating gRPC channels to all peers (nPeers = 1).
[Jul 31 2025 14:57:35] [INFO] [tid 74] Started processing of incoming messages.
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[Jul 31 2025 14:57:35] [INFO] [tid 75] Connection established to node 0 with ip address 192.168.34.137. Number of times connected: 1
[Jul 31 2025 14:57:35] [INFO] [tid 47] GPU event successfully subscribed
I0731 14:57:36.823729       1 computedomain.go:218] IP set did not change
[...]

Clean up:

$ kubectl delete -f imex-channel-injection.yaml
computedomain.resource.nvidia.com "imex-channel-injection" deleted
pod "imex-channel-injection" deleted

2) Multi-node `nvbandwidth` test (with MPI)

A two-node nvbandwidth test that consumes four GPUs on each node.

Install the MPI Operator

kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml

Create the spec file

cat <<EOF > nvbandwidth-test-job.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
  name: nvbandwidth-test-compute-domain
spec:
  numNodes: 2
  channel:
    resourceClaimTemplate:
      name: nvbandwidth-test-compute-domain-channel

---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
  name: nvbandwidth-test
spec:
  slotsPerWorker: 4
  launcherCreationPolicy: WaitForWorkersReady
  runPolicy:
    cleanPodPolicy: Running
  sshAuthMountPath: /home/mpiuser/.ssh
  mpiReplicaSpecs:
    Launcher:
      replicas: 1
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-launcher
        spec:
          affinity:
            nodeAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
                nodeSelectorTerms:
                - matchExpressions:
                  - key: node-role.kubernetes.io/control-plane
                    operator: Exists
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-launcher
            securityContext:
              runAsUser: 1000
            command:
            - mpirun
            args:
            - --bind-to
            - core
            - --map-by
            - ppr:4:node
            - -np
            - "8"
            - --report-bindings
            - -q
            - nvbandwidth
            - -t
            - multinode_device_to_device_memcpy_read_ce
    Worker:
      replicas: 2
      template:
        metadata:
          labels:
            nvbandwidth-test-replica: mpi-worker
        spec:
          affinity:
            podAffinity:
              requiredDuringSchedulingIgnoredDuringExecution:
              - labelSelector:
                  matchExpressions:
                  - key: nvbandwidth-test-replica
                    operator: In
                    values:
                    - mpi-worker
                topologyKey: nvidia.com/gpu.clique
          containers:
          - image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
            name: mpi-worker
            securityContext:
              runAsUser: 1000
            env:
            command:
            - /usr/sbin/sshd
            args:
            - -De
            - -f
            - /home/mpiuser/.sshd_config
            resources:
              limits:
                nvidia.com/gpu: 4
              claims:
              - name: compute-domain-channel
          resourceClaims:
          - name: compute-domain-channel
            resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
EOF

Apply the spec, and inspect output

$ kubectl apply -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain created
mpijob.kubeflow.org/nvbandwidth-test created

$ kubectl get pods
NAME                              READY   STATUS      RESTARTS   AGE
nvbandwidth-test-launcher-lzv84   1/1     Running     0          3s
nvbandwidth-test-worker-0         1/1     Running     0          15s
nvbandwidth-test-worker-1         1/1     Running     0          15s

$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
NAME                                          READY   STATUS    RESTARTS   AGE
nvbandwidth-test-compute-domain-ht24d-9jhmj   1/1     Running   0          20s
nvbandwidth-test-compute-domain-ht24d-rcn2c   1/1     Running   0          20s

$ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
[nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 4 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 5 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 6 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
nvbandwidth Version: v0.7
Built from Git version: v0.7

MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
CUDA Runtime Version: 12080
CUDA Driver Version: 12080
Driver Version: 570.124.06

Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00)
Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00)
Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00)
Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00)
Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00)
Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00)
Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00)
Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00)

Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
           0         1         2         3         4         5         6         7
 0       N/A    798.02    798.25    798.02    798.02    797.88    797.73    797.95
 1    798.10       N/A    797.80    798.02    798.02    798.25    797.88    798.02
 2    797.95    797.95       N/A    797.73    797.80    797.95    797.95    797.65
 3    798.10    798.02    797.95       N/A    798.02    798.10    797.88    797.73
 4    797.80    798.02    798.02    798.02       N/A    797.95    797.80    798.02
 5    797.80    797.95    798.10    798.10    797.95       N/A    797.95    797.88
 6    797.73    797.95    798.10    798.02    797.95    797.88       N/A    797.80
 7    797.88    798.02    797.95    798.02    797.88    797.95    798.02       N/A

SUM multinode_device_to_device_memcpy_read_ce 44685.29

NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.

Clean up

$ kubectl delete -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com "nvbandwidth-test-compute-domain" deleted
mpijob.kubeflow.org "nvbandwidth-test" deleted

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate setup for ComputeDomain allocation

Prerequisites

Validate that DRA driver is running

Validate clique node labels

Run validation workloads

1) IMEX channel injection test

2) Multi-node `nvbandwidth` test (with MPI)

Install the MPI Operator

Create the spec file

Apply the spec, and inspect output

Clean up

Uh oh!

Uh oh!

Clone this wiki locally

Validate setup for ComputeDomain allocation

Prerequisites

Validate that DRA driver is running

Validate clique node labels

Run validation workloads

1) IMEX channel injection test

2) Multi-node nvbandwidth test (with MPI)

Install the MPI Operator

Create the spec file

Apply the spec, and inspect output

Clean up

Uh oh!

Uh oh!

Clone this wiki locally

2) Multi-node `nvbandwidth` test (with MPI)