-
Notifications
You must be signed in to change notification settings - Fork 89
Validate setup for ComputeDomain allocation
Dr. Jan-Philip Gehrcke edited this page Sep 3, 2025
·
9 revisions
This page assumes that you have followed the installation instructions, and that all relevant GPU Operator components are running, and in a Ready
state.
As a quick reminder (to overcome common sources of error):
- Use Kubernetes 1.32 or later, with DRA and CDI enabled on all nodes (docs, more docs).
- If you have
nvidia-imex-*
packages installed (via your Linux distribution's package manager): disable thenvidia-imex.service
systemd unit (on all GPU nodes), with e.g.systemctl disable --now nvidia-imex.service && systemctl mask nvidia-imex.service
.
Example:
$ kubectl get pod -n nvidia-dra-driver-gpu
NAME READY STATUS RESTARTS AGE
nvidia-dra-driver-k8s-dra-driver-controller-67cb99d84b-5q7kj 1/1 Running 0 7m26s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-7kdg9 1/1 Running 0 7m27s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bd6gn 1/1 Running 0 7m27s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-bzm6p 1/1 Running 0 7m26s
nvidia-dra-driver-k8s-dra-driver-kubelet-plugin-xjm4p 1/1 Running 0 7m27s
Confirm all expected nodes run a *-k8s-dra-driver-kubelet-plugin-*
pod, and that the READY
column indicates readiness for all listed pods.
Next up, a meaningful validation step is to confirm that all GPU nodes have been decorated with the Kubernetes node label nvidia.com/gpu.clique
. Example:
$ (echo -e "NODE\tLABEL\tCLIQUE"; kubectl get nodes -o json | \
/usr/bin/jq -r '.items[] | [.metadata.name, "nvidia.com/gpu.clique", .metadata.labels["nvidia.com/gpu.clique"]] | @tsv') | \
column -t
NODE LABEL CLIQUE
gb-nvl-043-bianca-1 nvidia.com/gpu.clique 9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
gb-nvl-043-bianca-2 nvidia.com/gpu.clique 9277d399-0674-44a9-b64e-d85bb19ce2b0.32766
Notes for troubleshooting:
- The label value is expected to have the shape
<CLIQUE_UUID,CLIQUE_ID>
. - The GPU Feature Discovery component of the GPU Operator inspects the nodes and sets these labels (docs).
- On a given node, per-GPU clique configuration can be inspected with e.g.
nvidia-smi -q | grep -E "ClusterUUID|CliqueId"
.
Run a simple test to validate that IMEX daemons are started and IMEX channels are injected:
cat <<EOF > imex-channel-injection.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: imex-channel-injection
spec:
numNodes: 1
channel:
resourceClaimTemplate:
name: imex-channel-0
---
apiVersion: v1
kind: Pod
metadata:
name: imex-channel-injection
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: nvidia.com/gpu.clique
operator: Exists
containers:
- name: ctr
image: ubuntu:22.04
command: ["bash", "-c"]
args: ["ls -la /dev/nvidia-caps-imex-channels; trap 'exit 0' TERM; sleep 9999 & wait"]
resources:
claims:
- name: imex-channel-0
resourceClaims:
- name: imex-channel-0
resourceClaimTemplateName: imex-channel-0
EOF
$ kubectl apply -f imex-channel-injection.yaml
computedomain.resource.nvidia.com/imex-channel-injection created
pod/imex-channel-injection created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
imex-channel-injection 1/1 Running 0 3s
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
NAME READY STATUS RESTARTS AGE
imex-channel-injection-6k9sx-ffgpf 1/1 Running 0 3s
$ kubectl logs imex-channel-injection
total 0
drwxr-xr-x 2 root root 60 Feb 19 10:43 .
drwxr-xr-x 6 root root 380 Feb 19 10:43 ..
crw-rw-rw- 1 root root 507, 0 Feb 19 10:43 channel0
$ kubectl logs -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain --tail=-1
I0731 14:57:34.920143 1 main.go:176] config: &{gb-nvl-043-compute07 e273cacb-141a-478b-9c24-263c784026b9 imex-channel-injection default 6a130f54-faaa-4b8f-847f-be44ab70f917.32766 192.168.34.137}
[...]
I0731 14:57:34.920644 1 process.go:152] Start watchdog
I0731 14:57:34.920685 1 main.go:233] wait for nodes update
[...]
I0731 14:57:34.926056 1 reflector.go:436] "Caches populated" type="*v1beta1.ComputeDomain" reflector="pkg/nvidia.com/informers/externalversions/factory.go:141"
[...]
I0731 14:57:35.024250 1 round_trippers.go:632] "Response" verb="PUT" url="https://10.96.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains/imex-channel-injection/status" status="200 OK" milliseconds=2
I0731 14:57:35.024442 1 computedomain.go:214] IP set changed: previous: map[]; new: map[192.168.34.137:{}]
I0731 14:57:35.024601 1 main.go:331] Current /etc/nvidia-imex/nodes_config.cfg:
192.168.34.137
I0731 14:57:35.024610 1 main.go:243] Got update, (re)start IMEX daemon
I0731 14:57:35.024616 1 process.go:67] Start: /usr/bin/nvidia-imex -c /etc/nvidia-imex/config.cfg
I0731 14:57:35.024964 1 process.go:92] Started process with pid 47
I0731 14:57:35.024970 1 main.go:233] wait for nodes update
I0731 14:57:35.029705 1 computedomain.go:218] IP set did not change
WARNING: failed to open IMEX log file errno = No such file or directory
INFO: using stderr for IMEX logging
IMEX Log initializing at: 7/31/2025 14:57:35.030
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX version 570.133.20 is running with the following configuration options
[Jul 31 2025 14:57:35] [INFO] [tid 47] Logging level = 4
[Jul 31 2025 14:57:35] [INFO] [tid 47] Logging file name/path =
[Jul 31 2025 14:57:35] [INFO] [tid 47] Append to log file = 0
[Jul 31 2025 14:57:35] [INFO] [tid 47] Max Log file size = 1024 (MBs)
[Jul 31 2025 14:57:35] [INFO] [tid 47] Use Syslog file = 0
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX Library communication bind interface =
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX library communication bind port = 50000
[Jul 31 2025 14:57:35] [INFO] [tid 47] Identified this node as ID 0, using bind IP of '192.168.34.137', and network interface of eth0
[Jul 31 2025 14:57:35] [INFO] [tid 47] nvidia-imex persistence file /var/run/nvidia-imex/persist.dat does not exist. Assuming no previous importers.
[Jul 31 2025 14:57:35] [INFO] [tid 47] NvGpu Library version matched with GPU Driver version
[Jul 31 2025 14:57:35] [INFO] [tid 71] Started processing of incoming messages.
[...]
[Jul 31 2025 14:57:35] [INFO] [tid 47] Creating gRPC channels to all peers (nPeers = 1).
[Jul 31 2025 14:57:35] [INFO] [tid 74] Started processing of incoming messages.
[Jul 31 2025 14:57:35] [INFO] [tid 47] IMEX_WAIT_FOR_QUORUM != FULL, continuing initialization without waiting for connections to all nodes.
[Jul 31 2025 14:57:35] [INFO] [tid 75] Connection established to node 0 with ip address 192.168.34.137. Number of times connected: 1
[Jul 31 2025 14:57:35] [INFO] [tid 47] GPU event successfully subscribed
I0731 14:57:36.823729 1 computedomain.go:218] IP set did not change
[...]
Clean up:
$ kubectl delete -f imex-channel-injection.yaml
computedomain.resource.nvidia.com "imex-channel-injection" deleted
pod "imex-channel-injection" deleted
A two-node nvbandwidth
test that consumes four GPUs on each node.
kubectl create -f https://github.com/kubeflow/mpi-operator/releases/download/v0.6.0/mpi-operator.yaml
cat <<EOF > nvbandwidth-test-job.yaml
---
apiVersion: resource.nvidia.com/v1beta1
kind: ComputeDomain
metadata:
name: nvbandwidth-test-compute-domain
spec:
numNodes: 2
channel:
resourceClaimTemplate:
name: nvbandwidth-test-compute-domain-channel
---
apiVersion: kubeflow.org/v2beta1
kind: MPIJob
metadata:
name: nvbandwidth-test
spec:
slotsPerWorker: 4
launcherCreationPolicy: WaitForWorkersReady
runPolicy:
cleanPodPolicy: Running
sshAuthMountPath: /home/mpiuser/.ssh
mpiReplicaSpecs:
Launcher:
replicas: 1
template:
metadata:
labels:
nvbandwidth-test-replica: mpi-launcher
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: node-role.kubernetes.io/control-plane
operator: Exists
containers:
- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
name: mpi-launcher
securityContext:
runAsUser: 1000
command:
- mpirun
args:
- --bind-to
- core
- --map-by
- ppr:4:node
- -np
- "8"
- --report-bindings
- -q
- nvbandwidth
- -t
- multinode_device_to_device_memcpy_read_ce
Worker:
replicas: 2
template:
metadata:
labels:
nvbandwidth-test-replica: mpi-worker
spec:
affinity:
podAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: nvbandwidth-test-replica
operator: In
values:
- mpi-worker
topologyKey: nvidia.com/gpu.clique
containers:
- image: ghcr.io/nvidia/k8s-samples:nvbandwidth-v0.7-8d103163
name: mpi-worker
securityContext:
runAsUser: 1000
env:
command:
- /usr/sbin/sshd
args:
- -De
- -f
- /home/mpiuser/.sshd_config
resources:
limits:
nvidia.com/gpu: 4
claims:
- name: compute-domain-channel
resourceClaims:
- name: compute-domain-channel
resourceClaimTemplateName: nvbandwidth-test-compute-domain-channel
EOF
$ kubectl apply -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com/nvbandwidth-test-compute-domain created
mpijob.kubeflow.org/nvbandwidth-test created
$ kubectl get pods
NAME READY STATUS RESTARTS AGE
nvbandwidth-test-launcher-lzv84 1/1 Running 0 3s
nvbandwidth-test-worker-0 1/1 Running 0 15s
nvbandwidth-test-worker-1 1/1 Running 0 15s
$ kubectl get pods -n nvidia-dra-driver-gpu -l resource.nvidia.com/computeDomain
NAME READY STATUS RESTARTS AGE
nvbandwidth-test-compute-domain-ht24d-9jhmj 1/1 Running 0 20s
nvbandwidth-test-compute-domain-ht24d-rcn2c 1/1 Running 0 20s
$ kubectl logs --tail=-1 -l job-name=nvbandwidth-test-launcher
Warning: Permanently added '[nvbandwidth-test-worker-0.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
Warning: Permanently added '[nvbandwidth-test-worker-1.nvbandwidth-test.default.svc]:2222' (ECDSA) to the list of known hosts.
[nvbandwidth-test-worker-0:00025] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 2 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-0:00025] MCW rank 3 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 4 bound to socket 0[core 0[hwt 0]]: [B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 5 bound to socket 0[core 1[hwt 0]]: [./B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 6 bound to socket 0[core 2[hwt 0]]: [././B/././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
[nvbandwidth-test-worker-1:00025] MCW rank 7 bound to socket 0[core 3[hwt 0]]: [./././B/./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.][./././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././././.]
nvbandwidth Version: v0.7
Built from Git version: v0.7
MPI version: Open MPI v4.1.4, package: Debian OpenMPI, ident: 4.1.4, repo rev: v4.1.4, May 26, 2022
CUDA Runtime Version: 12080
CUDA Driver Version: 12080
Driver Version: 570.124.06
Process 0 (nvbandwidth-test-worker-0): device 0: HGX GB200 (00000008:01:00)
Process 1 (nvbandwidth-test-worker-0): device 1: HGX GB200 (00000009:01:00)
Process 2 (nvbandwidth-test-worker-0): device 2: HGX GB200 (00000018:01:00)
Process 3 (nvbandwidth-test-worker-0): device 3: HGX GB200 (00000019:01:00)
Process 4 (nvbandwidth-test-worker-1): device 0: HGX GB200 (00000008:01:00)
Process 5 (nvbandwidth-test-worker-1): device 1: HGX GB200 (00000009:01:00)
Process 6 (nvbandwidth-test-worker-1): device 2: HGX GB200 (00000018:01:00)
Process 7 (nvbandwidth-test-worker-1): device 3: HGX GB200 (00000019:01:00)
Running multinode_device_to_device_memcpy_read_ce.
memcpy CE GPU(row) -> GPU(column) bandwidth (GB/s)
0 1 2 3 4 5 6 7
0 N/A 798.02 798.25 798.02 798.02 797.88 797.73 797.95
1 798.10 N/A 797.80 798.02 798.02 798.25 797.88 798.02
2 797.95 797.95 N/A 797.73 797.80 797.95 797.95 797.65
3 798.10 798.02 797.95 N/A 798.02 798.10 797.88 797.73
4 797.80 798.02 798.02 798.02 N/A 797.95 797.80 798.02
5 797.80 797.95 798.10 798.10 797.95 N/A 797.95 797.88
6 797.73 797.95 798.10 798.02 797.95 797.88 N/A 797.80
7 797.88 798.02 797.95 798.02 797.88 797.95 798.02 N/A
SUM multinode_device_to_device_memcpy_read_ce 44685.29
NOTE: The reported results may not reflect the full capabilities of the platform.
Performance can vary with software drivers, hardware clocks, and system topology.
$ kubectl delete -f nvbandwidth-test-job.yaml
computedomain.resource.nvidia.com "nvbandwidth-test-compute-domain" deleted
mpijob.kubeflow.org "nvbandwidth-test" deleted