-
Notifications
You must be signed in to change notification settings - Fork 89
Gpu health check #545
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Gpu health check #545
Conversation
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
current log:
resourceclaim status update is still broken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Adds preliminary GPU health monitoring functionality to detect and handle unhealthy GPU devices in the NVIDIA DRA driver. The implementation listens for NVML events (XID errors, ECC errors) and removes unhealthy devices from the allocatable pool.
- Introduces device health status tracking with
Healthy
/Unhealthy
states - Implements NVML event-based health monitoring for GPU devices
- Updates resource claim status to reflect device health conditions
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.
Show a summary per file
File | Description |
---|---|
cmd/gpu-kubelet-plugin/nvlib.go | Initialize all devices with Healthy status |
cmd/gpu-kubelet-plugin/driver.go | Add device health monitor initialization and health notification handling |
cmd/gpu-kubelet-plugin/device_state.go | Add device health status updates and resource claim status reporting |
cmd/gpu-kubelet-plugin/device_health.go | New file implementing NVML event-based health monitoring |
cmd/gpu-kubelet-plugin/allocatable.go | Add health status field and methods to AllocatableDevice |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
cmd/gpu-kubelet-plugin/driver.go
Outdated
if err != nil { | ||
return nil, fmt.Errorf("start deviceHealthMonitor: %w", err) | ||
} | ||
klog.Info("[SWATI DEBUGS] Started device health monitor") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the log message: 'DEBUGS' should be 'DEBUG' to match the pattern used in other debug messages.
klog.Info("[SWATI DEBUGS] Started device health monitor") | |
klog.Info("[SWATI DEBUG] Started device health monitor") |
Copilot uses AI. Check for mistakes.
cmd/gpu-kubelet-plugin/driver.go
Outdated
var resourceSlice resourceslice.Slice | ||
for _, dev := range d.state.allocatable { | ||
if dev.IsHealthy() { | ||
klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the log message: 'resoureslice' should be 'resourceslice'.
klog.Infof("[SWATI DEBUG] device is healthy, added to resoureslice: %v", dev) | |
klog.Infof("[SWATI DEBUG] device is healthy, added to resourceslice: %v", dev) |
Copilot uses AI. Check for mistakes.
cmd/gpu-kubelet-plugin/driver.go
Outdated
} | ||
|
||
// Republish updated resources | ||
klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a typo in the log message: 'rebulishing' should be 'republishing'.
klog.Info("[SWATI DEBUG] rebulishing resourceslice with healthy devices") | |
klog.Info("[SWATI DEBUG] republishing resourceslice with healthy devices") |
Copilot uses AI. Check for mistakes.
Config: configapi.DefaultMigDeviceConfig(), | ||
}) | ||
|
||
// Swati: Add resourceclaim status update |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should follow proper Go comment conventions and be more descriptive. Consider: '// Add resource claim status update to track device health'.
// Swati: Add resourceclaim status update | |
// Add resource claim status update to track device health. |
Copilot uses AI. Check for mistakes.
// Swati add health check | ||
klog.Info("[SWATI DEBUG] adding device status") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment should follow proper Go comment conventions. Consider: '// Add health status to device allocation result'.
// Swati add health check | |
klog.Info("[SWATI DEBUG] adding device status") | |
// Add health status to device allocation result |
Copilot uses AI. Check for mistakes.
//defer nvdevlib.alwaysShutdown() | ||
|
||
//klog.Info("[SWATI DEBUG] getting all devices..") | ||
//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config) | ||
//if err != nil { | ||
// return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | ||
//} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commented-out code should be removed. If this code might be needed later, consider documenting why it's commented out or remove it entirely.
//defer nvdevlib.alwaysShutdown() | |
//klog.Info("[SWATI DEBUG] getting all devices..") | |
//allocatable, err := nvdevlib.enumerateAllPossibleDevices(config) | |
//if err != nil { | |
// return nil, fmt.Errorf("error enumerating all possible devices: %w", err) | |
//} |
Copilot uses AI. Check for mistakes.
} | ||
|
||
func newDeviceHealthMonitor(ctx context.Context, config *Config, allocatable AllocatableDevices, nvdevlib *deviceLib) (*deviceHealthMonitor, error) { | ||
klog.Info("[SWATI DEBUG] initializing NVML..") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The log message has inconsistent punctuation. Either use 'NVML...' (with proper ellipsis) or 'NVML' (without trailing dots).
klog.Info("[SWATI DEBUG] initializing NVML..") | |
klog.Info("[SWATI DEBUG] initializing NVML") |
Copilot uses AI. Check for mistakes.
More logs after fixing republish of resourceslice when unhealthy gpu found
|
need to fix resourceclaim status update: not using the right client api
|
Resource slices should not be republished when a GPU goes unhealthy. The unhealthy GPU should still be listed but a device taint should be added for it so the scheduler doesn't schedule it. |
Unless something's changed recently that I'm not aware of, there is no ResourceSlice status. We only have a spec so far as we haven't had a need to add a status yet. |
Yes. this is just to test the e2e flow (which is to report any health events and example action is to republish the slice by the driver). This is just to see if i have setup everything correctly. |
Not resourceslice, but update the resourceclaim status similar to this https://github.com/google/dranet/pull/78/files#diff-e8a7e777d80a14b455bdbf7aae3f28ad8082ffa0a06579e11cc1af741b5f98f7R266 |
Got the resourceclaim status to be updated
|
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
717656d
to
d1852f0
Compare
Signed-off-by: Swati Gupta <[email protected]>
Updated action on health event: update device condition to unhealthy in resourceclaim status
|
@ArangoGutierrez @klueska can i get a prelim review on this. There are still some tasks but its in working state. |
need to check how to enable the DeviceHealth FG from helm. |
/subscribe |
Signed-off-by: Swati Gupta <[email protected]>
ed314bf
to
f80aa9a
Compare
Signed-off-by: Swati Gupta <[email protected]>
The code is ready to review. I still need to test MIG flow but it works for full GPU. Need some minor refractor for device status updated for which i have added TODOs. |
ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet) | ||
if ret == nvml.ERROR_NOT_SUPPORTED { | ||
klog.Warningf("Device %v is too old to support healthchecking.", u) | ||
} | ||
if ret != nvml.SUCCESS { | ||
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret) | ||
m.unhealthy <- dev | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make it look like the other two checks above? feels odd otherwise. Also the continue
is missing (which is present in the other 2 checks above.
ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet) | |
if ret == nvml.ERROR_NOT_SUPPORTED { | |
klog.Warningf("Device %v is too old to support healthchecking.", u) | |
} | |
if ret != nvml.SUCCESS { | |
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret) | |
m.unhealthy <- dev | |
} | |
ret = gpu.RegisterEvents(eventMask&supportedEvents, m.eventSet) | |
if ret != nvml.SUCCESS { | |
if ret == nvml.ERROR_NOT_SUPPORTED { | |
klog.Warningf("Device %v is too old to support healthchecking.", u) | |
} | |
klog.Infof("unable to register events for %s: %v; marking it as unhealthy", u, ret) | |
m.unhealthy <- dev | |
continue | |
} |
} | ||
|
||
func (m *deviceHealthMonitor) registerDevicesForEvents() { | ||
eventMask := uint64(nvml.EventTypeXidCriticalError | nvml.EventTypeDoubleBitEccError | nvml.EventTypeSingleBitEccError) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
below in line 169 we skip nvml.EventTypeDoubleBitEccError
and nvml.EventTypeSingleBitEccError
using
if event.EventType != nvml.EventTypeXidCriticalError {
So should we drop those two from the eventMask here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ECC errors are not fatal usually. All this is taken as it is from
https://github.com/NVIDIA/k8s-device-plugin/blob/af276bfa4be954f6ac7534cc01891a3a7dcb436f/internal/rm/health.go#L39
continue | ||
} | ||
|
||
if event.EventType != nvml.EventTypeXidCriticalError { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we are registering for nvml.EventTypeDoubleBitEccError
and nvml.EventTypeSingleBitEccError
also above... see comment above
case device, ok := <-d.deviceHealthMonitor.Unhealthy(): | ||
if !ok { | ||
klog.V(6).Info("Health monitor channel closed") | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this channel close? looks like we stop processing device health notifications here as well. Do we want this as a Warning?
Destination: &flags.healthcheckPort, | ||
EnvVars: []string{"HEALTHCHECK_PORT"}, | ||
}, | ||
&cli.StringFlag{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is StringSliceFlag
a better choice? (see how we use it in https://github.com/search?q=repo%3ANVIDIA%2Fk8s-dra-driver-gpu%20%2FadditionalNamespaces%2F&type=code)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comma separation will still work i think!
test of skipped xid:
|
Signed-off-by: Swati Gupta <[email protected]>
Addressing #360 to add preliminary health check