-
Notifications
You must be signed in to change notification settings - Fork 89
Draft: Add more logs to computedomain manager #285
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Swati Gupta <[email protected]>
Signed-off-by: Swati Gupta <[email protected]>
574a42d
to
7fe7ff2
Compare
/ok to test |
// check if all resource claims for workloads are gone | ||
cd, err := m.Get(cdUID) | ||
if err != nil { | ||
return fmt.Errorf("error retrieving ComputeDomain: %w", err) | ||
} | ||
|
||
resourceClaims, err := m.config.clientsets.Core.ResourceV1beta1().ResourceClaims(cd.Namespace).List(ctx, metav1.ListOptions{ | ||
LabelSelector: metav1.FormatLabelSelector(labelSelector), | ||
}) | ||
if err != nil { | ||
return fmt.Errorf("error retrieving ResourceClaims: %w", err) | ||
} | ||
|
||
if len(resourceClaims.Items) != 0 { | ||
claimNames := []string{} | ||
for _, claim := range resourceClaims.Items { | ||
claimNames = append(claimNames, claim.Name) | ||
} | ||
klog.Errorf("Found %d ResourceClaims for ComputeDomain with UID %s: %v", | ||
len(resourceClaims.Items), cdUID, claimNames) | ||
return fmt.Errorf("ResourceClaims exist for ComputeDomain %s", cdUID) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not related to adding logs and should be in its own PR.
That said, as currently implemented this is a NOOP -- the ResourceClaims generated from our ResourceClaimTemplate for workloads will never have the ComputeDomain
label applied to them, so the list operation will always return an empty list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes agree. this should be in a separate PR. I just added this here after i added the iterations on node.Names. It was based on the list of nodes logic. And anyway i spent more time adding this part than the logs itself.
Oh, i checked this https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/templates/compute-domain-daemon-claim-template.tmpl.yaml#L10 and thought this label will be available in all the resourceclaim templates
Oh i see, its an empty field at the time of generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment was actually more relevant on an older iteration of the code. The right way to guarantee that all workloads have stopped now is to wait until no nodes have the ComputeDomain label anymore. This is sufficient because the workload pods remove this label as they shutdown.
return fmt.Errorf("error retrieving ComputeDomain: %w", err) | ||
} | ||
if cd == nil { | ||
klog.Infof("ComputeDomain with UID %s not found, nothing to do", uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be (at least) verbose level 2.
factory := nvinformers.NewSharedInformerFactory(config.clientsets.Nvidia, informerResyncPeriod) | ||
informer := factory.Resource().V1beta1().ComputeDomains().Informer() | ||
|
||
klog.Infof("Creating new ComputeDomainManager with config %+v", config) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
klog.Infof("Creating new ComputeDomainManager with config %+v", config) | |
klog.Infof("Creating new ComputeDomainManager with config:\n%+v", config) | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is actually not giving much valuable info
Creating new ComputeDomainManager with config &{driverName:compute-domain.nvidia.com driverNamespace:nvidia-dra-driver-gpu clientsets: {Core:0x4000685a40 Nvidia:0x40003c8c50} workQueue:0x4000612020}
may be expand on the subfields?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request adds several log statements to the ComputeDomainManager to provide more visibility into its operations. Key changes include:
- Logging configuration details upon creation of ComputeDomainManager.
- Enhanced logging for cases when ComputeDomains are not found or when finalizers are missing.
- Detailed logs in the AssertWorkloadsCompleted function for both node labels and resource claims.
Comments suppressed due to low confidence (2)
cmd/compute-domain-controller/computedomain.go:227
- There appears to be an extra space before the UID format specifier. Removing the extra space would improve the consistency of the log formatting.
klog.Errorf("Found %d nodes with label for ComputeDomain with UID %s: %v", len(nodes.Items), cdUID, nodeNames)
cmd/compute-domain-controller/computedomain.go:249
- [nitpick] Logging the error and also returning an error for resource claims may result in duplicate error reporting. Consider choosing one approach to minimize redundant logging.
klog.Errorf("Found %d ResourceClaims for ComputeDomain with UID %s: %v", len(resourceClaims.Items), cdUID, claimNames)
50a9620
to
6af8bc2
Compare
Signed-off-by: Swati Gupta <[email protected]>
d4c402d
to
eb50da8
Compare
Signed-off-by: Swati Gupta <[email protected]>
eb50da8
to
6c543dc
Compare
/ok to test |
ds = append(ds, d) | ||
} | ||
|
||
klog.V(2).Infof("Found %d objects with ComputeDomain Label with UID %s", len(ds), cdUID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resulting log:
I0331 20:49:32.819074 1 indexers.go:76] Found 1 objects with ComputeDomain Label with UID 2577168b-75d1-4c51-a659-d2225e2fe24f
I0331 20:49:32.819083 1 indexers.go:76] Found 1 objects with ComputeDomain Label with UID 2577168b-75d1-4c51-a659-d2225e2fe24f
} | ||
|
||
if int(d.Status.NumberReady) != cd.Spec.NumNodes { | ||
klog.V(2).Infof("DaemonSet %s/%s has %d ready nodes, expecting %d, waiting for all nodes to be ready", d.Namespace, d.Name, d.Status.NumberReady, cd.Spec.NumNodes) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test log:
I0331 20:49:32.819099 1 daemonset.go:337] DaemonSet nvidia-dra-driver-gpu/nvbandwidth-test-compute-domain-trgmx has 0 ready nodes, expecting 4, waiting for all nodes to be ready
|
||
informer := factory.Apps().V1().DaemonSets().Informer() | ||
|
||
klog.Infof("Creating new DaemonSetManager for driver %s/%s", config.driverNamespace, config.driverName) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test log:
I0331 20:49:32.416950 1 daemonset.go:89] Creating new DaemonSetManager for driver nvidia-dra-driver-gpu/compute-domain.nvidia.com
// It initializes the work queue, starts the ComputeDomain manager, and handles | ||
// graceful shutdown when the context is cancelled. | ||
func (c *Controller) Run(ctx context.Context) error { | ||
klog.Info("Starting ComputeDomain Controller") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test:
I0331 20:49:32.416860 1 controller.go:62] Starting ComputeDomain Controller
factory := nvinformers.NewSharedInformerFactory(config.clientsets.Nvidia, informerResyncPeriod) | ||
informer := factory.Resource().V1beta1().ComputeDomains().Informer() | ||
|
||
klog.Infof("Creating new ComputeDomainManager for %s/%s", config.driverName, config.driverNamespace) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test:
I0331 20:49:32.416920 1 computedomain.go:68] Creating new ComputeDomainManager for compute-domain.nvidia.com/nvidia-dra-driver-gpu
} | ||
|
||
func NewCleanupManager[T metav1.Object](informer cache.SharedIndexInformer, getComputeDomain GetComputeDomainFunc, callback CleanupCallback[T]) *CleanupManager[T] { | ||
klog.Infof("Creating new Cleanup Manager for %T", *new(T)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test:
I0331 20:49:32.416970 1 cleanup.go:29] Creating new Cleanup Manager for *v1beta1.ResourceClaimTemplate
I0331 20:49:32.416975 1 cleanup.go:29] Creating new Cleanup Manager for *v1.DaemonSet
// It initializes the work queue, starts the ComputeDomain manager, and handles | ||
// graceful shutdown when the context is cancelled. | ||
func (c *Controller) Run(ctx context.Context) error { | ||
klog.Info("Starting ComputeDomain Controller") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test:
I0331 20:49:32.416860 1 controller.go:62] Starting ComputeDomain Controller
Signed-off-by: Swati Gupta <[email protected]>
} | ||
|
||
if computeDomain != nil { | ||
klog.V(6).Infof("ComputeDomain with UID %s still exists, skipping cleanup", uid) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found 1 items to check for cleanup
I0401 20:39:32.718643 1 cleanup.go:94] ComputeDomain with UID 2577168b-75d1-4c51-a659-d2225e2fe24f still exists, skipping cleanup
@klueska i like the style and organization of Run.ai logging. An example snapshot of the scheduler logs
|
This is the beginning of adding more logging to the code. Some of it like the functioning of compute domain manager can be in the default mode and others like deamonset manager and resourceclaim template manager can be in the verbose mode. I will add a debug mode later.
If we feel even this is too verbose, i can move them to debug mode.