Gpu health check #545

guptaNswati · 2025-09-06T00:30:40Z

Addressing #360 to add preliminary health check

Signed-off-by: Swati Gupta <[email protected]>

copy-pr-bot · 2025-09-06T00:30:43Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

guptaNswati · 2025-09-06T00:31:32Z

current log:

Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0905 23:51:54.358343       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
W0905 23:51:54.358366       1 device_state.go:619] Attempted to mark unknown device as unhealthy: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0905 23:51:54.358482       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

resourceclaim status update is still broken.

Copilot

Pull Request Overview

Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.

Introduces device health status tracking with Healthy/Unhealthy states
Implements NVML event-based health monitoring for GPU devices
Updates resource claim status to reflect device health conditions

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
cmd/gpu-kubelet-plugin/nvlib.go	Initialize all devices with `Healthy` status
cmd/gpu-kubelet-plugin/driver.go	Add device health monitor initialization and health notification handling
cmd/gpu-kubelet-plugin/device_state.go	Add device health status updates and resource claim status reporting
cmd/gpu-kubelet-plugin/device_health.go	New file implementing NVML event-based health monitoring
cmd/gpu-kubelet-plugin/allocatable.go	Add health status field and methods to AllocatableDevice

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-09-08T16:04:09Z

cmd/gpu-kubelet-plugin/driver.go

+	if err != nil {
+		return nil, fmt.Errorf("start deviceHealthMonitor: %w", err)
+	}
+	klog.Info("[SWATI DEBUGS] Started device health monitor")


There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.

Suggested change

klog.Info("[SWATI DEBUGS] Started device health monitor")

klog.Info("[SWATI DEBUG] Started device health monitor")

Copilot · 2025-09-08T16:04:09Z

cmd/gpu-kubelet-plugin/driver.go

+			var resourceSlice resourceslice.Slice
+			for _, dev := range d.state.allocatable {
+				if dev.IsHealthy() {
+					klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)


There's a typo in the log message: 'resoureslice' should be 'resourceslice'.

Suggested change

klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)

klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev)

Copilot · 2025-09-08T16:04:10Z

cmd/gpu-kubelet-plugin/driver.go

+			}
+
+			// Republish updated resources
+			klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")


There's a typo in the log message: 'rebulishing' should be 'republishing'.

Suggested change

klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")

klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices")

Copilot · 2025-09-08T16:04:10Z

cmd/gpu-kubelet-plugin/device_state.go

 		Config:   configapi.DefaultMigDeviceConfig(),
 	})

+	// Swati: Add resourceclaim status update


The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.

Suggested change

// Swati: Add resourceclaim status update

// Add resource claim status update to track device health.

Copilot · 2025-09-08T16:04:10Z

cmd/gpu-kubelet-plugin/device_state.go

+		// Swati add health check
+		klog.Info("[SWATI DEBUG] adding device status")


The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.

Suggested change

// Swati add health check

klog.Info("[SWATI DEBUG] adding device status")

// Add health status to device allocation result

Copilot · 2025-09-08T16:04:11Z

cmd/gpu-kubelet-plugin/device_health.go

+	//defer nvdevlib.alwaysShutdown()
+
+	//klog.Info("[SWATI DEBUG] getting all devices..")
+	//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)
+	//if err != nil {
+	//	return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
+	//}
+


Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.

Suggested change

//defer nvdevlib.alwaysShutdown()

//klog.Info("[SWATI DEBUG] getting all devices..")

//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)

//if err != nil {

// return nil, fmt.Errorf("error enumerating all possible devices: %w", err)

//}

Copilot · 2025-09-08T16:04:11Z

cmd/gpu-kubelet-plugin/device_health.go

+}
+
+func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) {
+	klog.Info("[SWATI DEBUG] initializing NVML..")


The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).

Suggested change

klog.Info("[SWATI DEBUG] initializing NVML..")

klog.Info("[SWATI DEBUG] initializing NVML")

guptaNswati · 2025-09-08T23:14:16Z

More logs after fixing republish of resourceslice when unhealthy gpu found

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-ndv47 -n nvidia-dra-driver-gpu  -c gpus | grep unhealth 
I0908 23:07:58.793308       1 device_health.go:173] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0908 23:07:58.793342       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0908 23:07:58.793371       1 device_state.go:636] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
E0908 23:07:58.793381       1 driver.go:220] device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 with uuid:&{%!s(*main.GpuInfo=&{GPU-a4f34abc-7715-3560-dcea-7238b9611a45 0 0 false 102625181696 NVIDIA GH200 96GB HBM3 Nvidia Hopper 9.0 570.86.15 12.8 0009:01:00.0 {resource.kubernetes.io/pcieRoot {<nil> <nil> 0x4000328130 <nil>}} [0x40008965a0 0x40008965d0 0x4000896600 0x4000896630 0x4000896660 0x4000896690 0x4000896840 0x40008972f0 0x4000897530 0x4000897560]}) %!s(*main.MigDeviceInfo=<nil>) Unhealthy} is unhealthy
I0908 23:07:58.793531       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

 "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "b5a8727d-b8cd-4073-8817-d3e31147a8bd",
	-  "resourceVersion": "50777207",
	-  "generation": 1,
	+  "resourceVersion": "50777758",
	+  "generation": 2,
	   "creationTimestamp": "2025-09-08T23:05:30Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-09-08T23:05:30Z",
	+    "time": "2025-09-08T23:07:58Z",
	     "fieldsType": "FieldsV1",
	     "fieldsV1": {
	      "f:metadata": {

$ kubectl get resourceslice  sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5  -o yaml 
apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  creationTimestamp: "2025-09-08T23:05:30Z"
  generateName: sc-starwars-mab9-b00-gpu.nvidia.com-
  generation: 2
  name: sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: sc-starwars-mab9-b00
    uid: 80ede971-5b44-4a12-a951-a1bebe79209d
  resourceVersion: "50777758"
  uid: b5a8727d-b8cd-4073-8817-d3e31147a8bd
spec:
  devices:
  - basic:
      attributes:
        architecture:
          string: Hopper
        brand:
          string: Nvidia
        cudaComputeCapability:
          version: 9.0.0
        cudaDriverVersion:
          version: 12.8.0
        driverVersion:
          version: 570.86.15
        index:
          int: 1
        minor:
          int: 1
        pcieBusID:
          string: "0019:01:00.0"
        productName:
          string: NVIDIA GH200 96GB HBM3
        resource.kubernetes.io/pcieRoot:
          string: pci0019:00
        type:
          string: gpu
        uuid:
          string: GPU-9e6df7cb-64d4-5e53-2b1d-cee9e58aeb94
      capacity:
        memory:
          value: 97871Mi
    name: gpu-1
  driver: gpu.nvidia.com
  nodeName: sc-starwars-mab9-b00
  pool:
    generation: 1
    name: sc-starwars-mab9-b00
    resourceSliceCount: 1

guptaNswati · 2025-09-08T23:14:27Z

need to fix resourceclaim status update: not using the right client api

Device gpu-0 is healthy, marking as ready
E0908 23:06:44.085161       1 device_state.go:346] failed to update status for claim gpu-test1/pod1-gpu-zc6s4: not implemented in k8s.io/dynamic-resource-allocation/client

failed to update status for claim gpu-test1/pod2-gpu-q45rg: not implemented in k8s.io/dynamic-resource-allocation/client

klueska · 2025-09-09T03:50:26Z

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

klueska · 2025-09-09T03:55:15Z

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

guptaNswati · 2025-09-09T17:56:27Z

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly.

guptaNswati · 2025-09-09T17:57:52Z

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266

guptaNswati · 2025-09-09T22:02:08Z

Got the resourceclaim status to be updated

 Device gpu-1 is healthy, marking as ready
I0909 21:53:04.772855       1 round_trippers.go:632] "Response" logger="dra" requestID=7 method="/k8s.io.kubelet.pkg.apis.dra.v1beta1.DRAPlugin/NodePrepareResources" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod1-gpu-rrkx5/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=4
I0909 21:53:04.772960       1 device_state.go:348] updated device status for claim gpu-test1/pod1-gpu-rrkx5

  devices:
  - conditions:
    - lastTransitionTime: "2025-09-09T21:53:04Z"
      message: Device is healthy and ready
      reason: Healthy
      status: "True"
      type: Ready
    data: null
    device: gpu-1
    driver: gpu.nvidia.com
    pool: sc-starwars-mab9-b00

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-09-19T03:29:18Z

Updated action on health event: update device condition to unhealthy in resourceclaim status

$ kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-m8xsz -n nvidia-dra-driver-gpu -c gpus

1 device_health.go:167] Processing event {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
W0919 02:59:06.452857       1 device_health.go:170] Critical XID error detected on device: {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I0919 02:59:06.452874       1 device_health.go:200] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0919 02:59:06.452905       1 driver.go:212] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0919 02:59:06.452918       1 device_state.go:617] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
I0919 02:59:06.453547       1 driver.go:298] found matching device to claim: gpu-0
I0919 02:59:06.453556       1 driver.go:312] Found it! Return the result object: gpu-0 and the claim UID: 590e5164-7511-418d-8b8b-77ae0e414dc6
I0919 02:59:06.456314       1 round_trippers.go:632] "Response" verb="GET" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceclaims" status="200 OK" milliseconds=2
I0919 02:59:06.456538       1 driver.go:335] found ResourceClaim with UID 590e5164-7511-418d-8b8b-77ae0e414dc6 not found
I0919 02:59:06.456548       1 driver.go:345] Applying 'Ready=False' condition for device 'gpu-0' in ResourceClaim 'gpu-test1/pod2-gpu-l8rrx'
I0919 02:59:06.460688       1 round_trippers.go:632] "Response" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod2-gpu-l8rrx/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=3

$ kubectl get resourceclaim -n gpu-test1  -o yaml | grep -A 8 condition
    - conditions:
      - lastTransitionTime: "2025-09-09T21:53:04Z"
        message: Device is healthy and ready
        reason: Healthy
        status: "True"
        type: Ready
      data: null
      device: gpu-1
      driver: gpu.nvidia.com
--
    - conditions:
      - lastTransitionTime: "2025-09-19T02:59:06Z"
        message: Device gpu-0 has become unhealthy.
        reason: DeviceUnhealthy
        status: "False"
        type: Ready
      data: null
      device: gpu-0
      driver: gpu.nvidia.com

guptaNswati · 2025-09-19T04:36:28Z

@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state.

guptaNswati · 2025-09-19T04:38:05Z

need to check how to enable the DeviceHealth FG from helm.

dims · 2025-09-22T16:34:17Z

/subscribe

cmd/gpu-kubelet-plugin/allocatable.go

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-09-23T01:04:36Z

The code is ready to review. I still need to test MIG flow but it works for full GPU. Need some minor refractor for device status updated for which i have added TODOs.

dims · 2025-09-23T01:23:04Z

cmd/gpu-kubelet-plugin/device_health.go

+		ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)
+		if ret == nvml.ERROR_NOT_SUPPORTED {
+			klog.Warningf("Device %v is too old to support healthchecking.", u)
+		}
+		if ret != nvml.SUCCESS {
+			klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)
+			m.unhealthy <- dev
+		}


Can we make it look like the other two checks above? feels odd otherwise. Also the continue is missing (which is present in the other 2 checks above.

Suggested change

ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)

if ret == nvml.ERROR_NOT_SUPPORTED {

klog.Warningf("Device %v is too old to support healthchecking.", u)

}

if ret != nvml.SUCCESS {

klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)

m.unhealthy <- dev

}

ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)

if ret != nvml.SUCCESS {

if ret == nvml.ERROR_NOT_SUPPORTED {

klog.Warningf("Device %v is too old to support healthchecking.", u)

}

klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)

m.unhealthy <- dev

continue

}

dims · 2025-09-23T01:32:24Z

cmd/gpu-kubelet-plugin/device_health.go

+}
+
+func (m *deviceHealthMonitor) registerDevicesForEvents() {
+	eventMask := uint64(nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError)


below in line 169 we skip nvml.EventTypeDoubleBitEccError and nvml.EventTypeSingleBitEccError using

if event.EventType != nvml.EventTypeXidCriticalError {

So should we drop those two from the eventMask here?

ECC errors are not fatal usually. All this is taken as it is from
https://github.com/NVIDIA/k8s-device-plugin/blob/af276bfa4be954f6ac7534cc01891a3a7dcb436f/internal/rm/health.go#L39

dims · 2025-09-23T01:33:13Z

cmd/gpu-kubelet-plugin/device_health.go

+				continue
+			}
+
+			if event.EventType != nvml.EventTypeXidCriticalError {


we are registering for nvml.EventTypeDoubleBitEccError and nvml.EventTypeSingleBitEccError also above... see comment above

cmd/gpu-kubelet-plugin/device_health.go

dims · 2025-09-23T01:49:22Z

cmd/gpu-kubelet-plugin/driver.go

+		case device, ok := <-d.deviceHealthMonitor.Unhealthy():
+			if !ok {
+				klog.V(6).Info("Health monitor channel closed")
+				return


when does this channel close? looks like we stop processing device health notifications here as well. Do we want this as a Warning?

cmd/gpu-kubelet-plugin/driver.go

dims · 2025-09-23T02:08:45Z

cmd/gpu-kubelet-plugin/main.go

 			Destination: &flags.healthcheckPort,
 			EnvVars:     []string{"HEALTHCHECK_PORT"},
 		},
+		&cli.StringFlag{


Is StringSliceFlag a better choice? (see how we use it in https://github.com/search?q=repo%3ANVIDIA%2Fk8s-dra-driver-gpu%20%2FadditionalNamespaces%2F&type=code)

comma separation will still work i think!

guptaNswati · 2025-09-24T19:20:01Z

test of skipped xid:

$ helm upgrade nvidia-dra-driver-gpu  deployments/helm/nvidia-dra-driver-gpu --set featureGates.DeviceHealthCheck=true --set kubeletPlugin.gpus.additionalXidsToIgnore="43"

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-qzplg  -n nvidia-dra-driver-gpu -c gpus | grep event
'I0924 18:24:31.947121       1 device_health.go:58] creating NVML events for device health monitor
I0924 18:24:31.947143       1 device_health.go:68] registering NVML events for device health monitor
I0924 18:28:04.610817       1 device_health.go:175] Skipping event {Device:{Handle:0xe44bad2ffef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati added 2 commits September 4, 2025 22:39

preliminary device health monitor

5069e1c

Signed-off-by: Swati Gupta <[email protected]>

publish health status

a4566a9

Signed-off-by: Swati Gupta <[email protected]>

github-project-automation bot added this to Planning Board: k8s-dra-driver-gpu Sep 6, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Sep 6, 2025

klueska added this to the v25.8.0 milestone Sep 8, 2025

klueska assigned guptaNswati Sep 8, 2025

ArangoGutierrez requested review from Copilot and ArangoGutierrez September 8, 2025 16:02

Copilot AI reviewed Sep 8, 2025

View reviewed changes

status update fixes

407562d

Signed-off-by: Swati Gupta <[email protected]>

klueska added the feature issue/PR that proposes a new feature or functionality label Sep 11, 2025

handle mig devices

d1852f0

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the gpu-health-check branch from 717656d to d1852f0 Compare September 17, 2025 21:46

klueska modified the milestones: v25.8.0, v25.12.0, v25.8.1 Sep 18, 2025

Update device condition in resourceclaim on health event

0e6516d

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati changed the title ~~Draft: Gpu health check~~ Gpu health check Sep 19, 2025

guptaNswati requested a review from klueska September 19, 2025 04:36

dims reviewed Sep 22, 2025

View reviewed changes

cmd/gpu-kubelet-plugin/allocatable.go Show resolved Hide resolved

featuregate healthcheck

f80aa9a

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the gpu-health-check branch from ed314bf to f80aa9a Compare September 22, 2025 18:57

refractor and formoat code

ebed3ea

Signed-off-by: Swati Gupta <[email protected]>