Skip to content

Conversation

guptaNswati
Copy link
Contributor

@guptaNswati guptaNswati commented Sep 6, 2025

Addressing #360 to add preliminary health check

Copy link

copy-pr-bot bot commented Sep 6, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati
Copy link
Contributor Author

current log:

Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0905 23:51:54.358343       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
W0905 23:51:54.358366       1 device_state.go:619] Attempted to mark unknown device as unhealthy: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0905 23:51:54.358482       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

resourceclaim status update is still broken.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.

  • Introduces device health status tracking with Healthy/Unhealthy states
  • Implements NVML event-based health monitoring for GPU devices
  • Updates resource claim status to reflect device health conditions

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
cmd/gpu-kubelet-plugin/nvlib.go Initialize all devices with Healthy status
cmd/gpu-kubelet-plugin/driver.go Add device health monitor initialization and health notification handling
cmd/gpu-kubelet-plugin/device_state.go Add device health status updates and resource claim status reporting
cmd/gpu-kubelet-plugin/device_health.go New file implementing NVML event-based health monitoring
cmd/gpu-kubelet-plugin/allocatable.go Add health status field and methods to AllocatableDevice

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

if err != nil {
return nil, fmt.Errorf("start deviceHealthMonitor: %w", err)
}
klog.Info("[SWATI DEBUGS] Started device health monitor")
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.

Suggested change
klog.Info("[SWATI DEBUGS] Started device health monitor")
klog.Info("[SWATI DEBUG] Started device health monitor")

Copilot uses AI. Check for mistakes.

var resourceSlice resourceslice.Slice
for _, dev := range d.state.allocatable {
if dev.IsHealthy() {
klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the log message: 'resoureslice' should be 'resourceslice'.

Suggested change
klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev)
klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev)

Copilot uses AI. Check for mistakes.

}

// Republish updated resources
klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a typo in the log message: 'rebulishing' should be 'republishing'.

Suggested change
klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices")
klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices")

Copilot uses AI. Check for mistakes.

Config: configapi.DefaultMigDeviceConfig(),
})

// Swati: Add resourceclaim status update
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.

Suggested change
// Swati: Add resourceclaim status update
// Add resource claim status update to track device health.

Copilot uses AI. Check for mistakes.

Comment on lines 305 to 306
// Swati add health check
klog.Info("[SWATI DEBUG] adding device status")
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.

Suggested change
// Swati add health check
klog.Info("[SWATI DEBUG] adding device status")
// Add health status to device allocation result

Copilot uses AI. Check for mistakes.

Comment on lines 43 to 50
//defer nvdevlib.alwaysShutdown()

//klog.Info("[SWATI DEBUG] getting all devices..")
//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)
//if err != nil {
// return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
//}

Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.

Suggested change
//defer nvdevlib.alwaysShutdown()
//klog.Info("[SWATI DEBUG] getting all devices..")
//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config)
//if err != nil {
// return nil, fmt.Errorf("error enumerating all possible devices: %w", err)
//}

Copilot uses AI. Check for mistakes.

}

func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) {
klog.Info("[SWATI DEBUG] initializing NVML..")
Copy link
Preview

Copilot AI Sep 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).

Suggested change
klog.Info("[SWATI DEBUG] initializing NVML..")
klog.Info("[SWATI DEBUG] initializing NVML")

Copilot uses AI. Check for mistakes.

@guptaNswati
Copy link
Contributor Author

More logs after fixing republish of resourceslice when unhealthy gpu found

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-ndv47 -n nvidia-dra-driver-gpu  -c gpus | grep unhealth 
I0908 23:07:58.793308       1 device_health.go:173] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0908 23:07:58.793342       1 driver.go:208] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0908 23:07:58.793371       1 device_state.go:636] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
E0908 23:07:58.793381       1 driver.go:220] device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 with uuid:&{%!s(*main.GpuInfo=&{GPU-a4f34abc-7715-3560-dcea-7238b9611a45 0 0 false 102625181696 NVIDIA GH200 96GB HBM3 Nvidia Hopper 9.0 570.86.15 12.8 0009:01:00.0 {resource.kubernetes.io/pcieRoot {<nil> <nil> 0x4000328130 <nil>}} [0x40008965a0 0x40008965d0 0x4000896600 0x4000896630 0x4000896660 0x4000896690 0x4000896840 0x40008972f0 0x4000897530 0x4000897560]}) %!s(*main.MigDeviceInfo=<nil>) Unhealthy} is unhealthy
I0908 23:07:58.793531       1 driver.go:235] Successfully republished resources after marking device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy

 "ResourceSlice update" logger="ResourceSlice controller" slice="sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5" diff=<
	@@ -3,8 +3,8 @@
	   "name": "sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5",
	   "generateName": "sc-starwars-mab9-b00-gpu.nvidia.com-",
	   "uid": "b5a8727d-b8cd-4073-8817-d3e31147a8bd",
	-  "resourceVersion": "50777207",
	-  "generation": 1,
	+  "resourceVersion": "50777758",
	+  "generation": 2,
	   "creationTimestamp": "2025-09-08T23:05:30Z",
	   "ownerReferences": [
	    {
	@@ -20,7 +20,7 @@
	     "manager": "gpu-kubelet-plugin",
	     "operation": "Update",
	     "apiVersion": "resource.k8s.io/v1beta1",
	-    "time": "2025-09-08T23:05:30Z",
	+    "time": "2025-09-08T23:07:58Z",
	     "fieldsType": "FieldsV1",
	     "fieldsV1": {
	      "f:metadata": {

$ kubectl get resourceslice  sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5  -o yaml 
apiVersion: resource.k8s.io/v1beta1
kind: ResourceSlice
metadata:
  creationTimestamp: "2025-09-08T23:05:30Z"
  generateName: sc-starwars-mab9-b00-gpu.nvidia.com-
  generation: 2
  name: sc-starwars-mab9-b00-gpu.nvidia.com-dlzq5
  ownerReferences:
  - apiVersion: v1
    controller: true
    kind: Node
    name: sc-starwars-mab9-b00
    uid: 80ede971-5b44-4a12-a951-a1bebe79209d
  resourceVersion: "50777758"
  uid: b5a8727d-b8cd-4073-8817-d3e31147a8bd
spec:
  devices:
  - basic:
      attributes:
        architecture:
          string: Hopper
        brand:
          string: Nvidia
        cudaComputeCapability:
          version: 9.0.0
        cudaDriverVersion:
          version: 12.8.0
        driverVersion:
          version: 570.86.15
        index:
          int: 1
        minor:
          int: 1
        pcieBusID:
          string: "0019:01:00.0"
        productName:
          string: NVIDIA GH200 96GB HBM3
        resource.kubernetes.io/pcieRoot:
          string: pci0019:00
        type:
          string: gpu
        uuid:
          string: GPU-9e6df7cb-64d4-5e53-2b1d-cee9e58aeb94
      capacity:
        memory:
          value: 97871Mi
    name: gpu-1
  driver: gpu.nvidia.com
  nodeName: sc-starwars-mab9-b00
  pool:
    generation: 1
    name: sc-starwars-mab9-b00
    resourceSliceCount: 1

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Sep 8, 2025

need to fix resourceclaim status update: not using the right client api

Device gpu-0 is healthy, marking as ready
E0908 23:06:44.085161       1 device_state.go:346] failed to update status for claim gpu-test1/pod1-gpu-zc6s4: not implemented in k8s.io/dynamic-resource-allocation/client

failed to update status for claim gpu-test1/pod2-gpu-q45rg: not implemented in k8s.io/dynamic-resource-allocation/client

@klueska
Copy link
Collaborator

klueska commented Sep 9, 2025

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

@klueska
Copy link
Collaborator

klueska commented Sep 9, 2025

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

@guptaNswati
Copy link
Contributor Author

Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it.

Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly.

@guptaNswati
Copy link
Contributor Author

Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet.

Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266

@guptaNswati
Copy link
Contributor Author

guptaNswati commented Sep 9, 2025

Got the resourceclaim status to be updated

 Device gpu-1 is healthy, marking as ready
I0909 21:53:04.772855       1 round_trippers.go:632] "Response" logger="dra" requestID=7 method="/k8s.io.kubelet.pkg.apis.dra.v1beta1.DRAPlugin/NodePrepareResources" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod1-gpu-rrkx5/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=4
I0909 21:53:04.772960       1 device_state.go:348] updated device status for claim gpu-test1/pod1-gpu-rrkx5

  devices:
  - conditions:
    - lastTransitionTime: "2025-09-09T21:53:04Z"
      message: Device is healthy and ready
      reason: Healthy
      status: "True"
      type: Ready
    data: null
    device: gpu-1
    driver: gpu.nvidia.com
    pool: sc-starwars-mab9-b00

Signed-off-by: Swati Gupta <[email protected]>
@klueska klueska added the feature issue/PR that proposes a new feature or functionality label Sep 11, 2025
Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

guptaNswati commented Sep 19, 2025

Updated action on health event: update device condition to unhealthy in resourceclaim status

$ kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-m8xsz -n nvidia-dra-driver-gpu -c gpus

1 device_health.go:167] Processing event {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
W0919 02:59:06.452857       1 device_health.go:170] Critical XID error detected on device: {Device:{Handle:0xee82742dfef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}
I0919 02:59:06.452874       1 device_health.go:200] Sending unhealthy notification for device GPU-a4f34abc-7715-3560-dcea-7238b9611a45 due to event type 8
W0919 02:59:06.452905       1 driver.go:212] Received unhealthy notification for device: GPU-a4f34abc-7715-3560-dcea-7238b9611a45
I0919 02:59:06.452918       1 device_state.go:617] Marked device:GPU-a4f34abc-7715-3560-dcea-7238b9611a45 unhealthy
I0919 02:59:06.453547       1 driver.go:298] found matching device to claim: gpu-0
I0919 02:59:06.453556       1 driver.go:312] Found it! Return the result object: gpu-0 and the claim UID: 590e5164-7511-418d-8b8b-77ae0e414dc6
I0919 02:59:06.456314       1 round_trippers.go:632] "Response" verb="GET" url="https://10.96.0.1:443/apis/resource.k8s.io/v1beta1/resourceclaims" status="200 OK" milliseconds=2
I0919 02:59:06.456538       1 driver.go:335] found ResourceClaim with UID 590e5164-7511-418d-8b8b-77ae0e414dc6 not found
I0919 02:59:06.456548       1 driver.go:345] Applying 'Ready=False' condition for device 'gpu-0' in ResourceClaim 'gpu-test1/pod2-gpu-l8rrx'
I0919 02:59:06.460688       1 round_trippers.go:632] "Response" verb="PATCH" url="https://x.x.x.x:443/apis/resource.k8s.io/v1beta1/namespaces/gpu-test1/resourceclaims/pod2-gpu-l8rrx/status?fieldManager=gpu.nvidia.com&force=true" status="200 OK" milliseconds=3

$ kubectl get resourceclaim -n gpu-test1  -o yaml | grep -A 8 condition
    - conditions:
      - lastTransitionTime: "2025-09-09T21:53:04Z"
        message: Device is healthy and ready
        reason: Healthy
        status: "True"
        type: Ready
      data: null
      device: gpu-1
      driver: gpu.nvidia.com
--
    - conditions:
      - lastTransitionTime: "2025-09-19T02:59:06Z"
        message: Device gpu-0 has become unhealthy.
        reason: DeviceUnhealthy
        status: "False"
        type: Ready
      data: null
      device: gpu-0
      driver: gpu.nvidia.com

@guptaNswati guptaNswati changed the title Draft: Gpu health check Gpu health check Sep 19, 2025
@guptaNswati
Copy link
Contributor Author

@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state.

@guptaNswati guptaNswati requested a review from klueska September 19, 2025 04:36
@guptaNswati
Copy link
Contributor Author

need to check how to enable the DeviceHealth FG from helm.

@dims
Copy link

dims commented Sep 22, 2025

/subscribe

Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

The code is ready to review. I still need to test MIG flow but it works for full GPU. Need some minor refractor for device status updated for which i have added TODOs.

Comment on lines +109 to +116
ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)
if ret == nvml.ERROR_NOT_SUPPORTED {
klog.Warningf("Device %v is too old to support healthchecking.", u)
}
if ret != nvml.SUCCESS {
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)
m.unhealthy <- dev
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make it look like the other two checks above? feels odd otherwise. Also the continue is missing (which is present in the other 2 checks above.

Suggested change
ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)
if ret == nvml.ERROR_NOT_SUPPORTED {
klog.Warningf("Device %v is too old to support healthchecking.", u)
}
if ret != nvml.SUCCESS {
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)
m.unhealthy <- dev
}
ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet)
if ret != nvml.SUCCESS {
if ret == nvml.ERROR_NOT_SUPPORTED {
klog.Warningf("Device %v is too old to support healthchecking.", u)
}
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret)
m.unhealthy <- dev
continue
}

}

func (m *deviceHealthMonitor) registerDevicesForEvents() {
eventMask := uint64(nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

below in line 169 we skip nvml.EventTypeDoubleBitEccError and nvml.EventTypeSingleBitEccError using

  if event.EventType != nvml.EventTypeXidCriticalError {

So should we drop those two from the eventMask here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

continue
}

if event.EventType != nvml.EventTypeXidCriticalError {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are registering for nvml.EventTypeDoubleBitEccError and nvml.EventTypeSingleBitEccError also above... see comment above

case device, ok := <-d.deviceHealthMonitor.Unhealthy():
if !ok {
klog.V(6).Info("Health monitor channel closed")
return
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this channel close? looks like we stop processing device health notifications here as well. Do we want this as a Warning?

Destination: &flags.healthcheckPort,
EnvVars: []string{"HEALTHCHECK_PORT"},
},
&cli.StringFlag{
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comma separation will still work i think!

@klueska klueska moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 23, 2025
@guptaNswati
Copy link
Contributor Author

test of skipped xid:

$ helm upgrade nvidia-dra-driver-gpu  deployments/helm/nvidia-dra-driver-gpu --set featureGates.DeviceHealthCheck=true --set kubeletPlugin.gpus.additionalXidsToIgnore="43"

kubectl logs nvidia-dra-driver-gpu-kubelet-plugin-qzplg  -n nvidia-dra-driver-gpu -c gpus | grep event
'I0924 18:24:31.947121       1 device_health.go:58] creating NVML events for device health monitor
I0924 18:24:31.947143       1 device_health.go:68] registering NVML events for device health monitor
I0924 18:28:04.610817       1 device_health.go:175] Skipping event {Device:{Handle:0xe44bad2ffef0} EventType:8 EventData:43 GpuInstanceId:4294967295 ComputeInstanceId:4294967295}

Signed-off-by: Swati Gupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature issue/PR that proposes a new feature or functionality
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

4 participants