Draft: Add more logs to computedomain manager #285

guptaNswati · 2025-03-12T21:27:28Z

This is the beginning of adding more logging to the code. Some of it like the functioning of compute domain manager can be in the default mode and others like deamonset manager and resourceclaim template manager can be in the verbose mode. I will add a debug mode later.

If we feel even this is too verbose, i can move them to debug mode.

Signed-off-by: Swati Gupta <[email protected]>

copy-pr-bot · 2025-03-12T21:27:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-03-12T22:26:18Z

/ok to test

klueska · 2025-03-12T23:11:08Z

cmd/compute-domain-controller/computedomain.go

+	// check if all resource claims for workloads are gone
+	cd, err := m.Get(cdUID)
+	if err != nil {
+		return fmt.Errorf("error retrieving ComputeDomain: %w", err)
+	}
+
+	resourceClaims, err := m.config.clientsets.Core.ResourceV1beta1().ResourceClaims(cd.Namespace).List(ctx, metav1.ListOptions{
+		LabelSelector: metav1.FormatLabelSelector(labelSelector),
+	})
+	if err != nil {
+		return fmt.Errorf("error retrieving ResourceClaims: %w", err)
+	}
+
+	if len(resourceClaims.Items) != 0 {
+		claimNames := []string{}
+		for _, claim := range resourceClaims.Items {
+			claimNames = append(claimNames, claim.Name)
+		}
+		klog.Errorf("Found %d ResourceClaims for ComputeDomain with UID %s: %v",
+			len(resourceClaims.Items), cdUID, claimNames)
+		return fmt.Errorf("ResourceClaims exist for ComputeDomain %s", cdUID)
+	}
+


This is not related to adding logs and should be in its own PR.

That said, as currently implemented this is a NOOP -- the ResourceClaims generated from our ResourceClaimTemplate for workloads will never have the ComputeDomain label applied to them, so the list operation will always return an empty list.

Yes agree. this should be in a separate PR. I just added this here after i added the iterations on node.Names. It was based on the list of nodes logic. And anyway i spent more time adding this part than the logs itself.

Oh, i checked this https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/templates/compute-domain-daemon-claim-template.tmpl.yaml#L10 and thought this label will be available in all the resourceclaim templates

Oh i see, its an empty field at the time of generation.

This comment was actually more relevant on an older iteration of the code. The right way to guarantee that all workloads have stopped now is to wait until no nodes have the ComputeDomain label anymore. This is sufficient because the workload pods remove this label as they shutdown.

cmd/compute-domain-controller/computedomain.go

klueska · 2025-03-12T23:15:29Z

cmd/compute-domain-controller/computedomain.go

 		return fmt.Errorf("error retrieving ComputeDomain: %w", err)
 	}
 	if cd == nil {
+		klog.Infof("ComputeDomain with UID %s not found, nothing to do", uid)


This should be (at least) verbose level 2.

cmd/compute-domain-controller/computedomain.go

klueska · 2025-03-12T23:15:45Z

cmd/compute-domain-controller/computedomain.go

 	factory := nvinformers.NewSharedInformerFactory(config.clientsets.Nvidia, informerResyncPeriod)
 	informer := factory.Resource().V1beta1().ComputeDomains().Informer()

+	klog.Infof("Creating new ComputeDomainManager with config %+v", config)


Suggested change

klog.Infof("Creating new ComputeDomainManager with config %+v", config)

klog.Infof("Creating new ComputeDomainManager with config:\n%+v", config)

This is actually not giving much valuable info
Creating new ComputeDomainManager with config &{driverName:compute-domain.nvidia.com driverNamespace:nvidia-dra-driver-gpu clientsets: {Core:0x4000685a40 Nvidia:0x40003c8c50} workQueue:0x4000612020}

may be expand on the subfields?

cmd/compute-domain-controller/computedomain.go

Copilot

Pull Request Overview

This pull request adds several log statements to the ComputeDomainManager to provide more visibility into its operations. Key changes include:

Logging configuration details upon creation of ComputeDomainManager.
Enhanced logging for cases when ComputeDomains are not found or when finalizers are missing.
Detailed logs in the AssertWorkloadsCompleted function for both node labels and resource claims.

Comments suppressed due to low confidence (2)

cmd/compute-domain-controller/computedomain.go:227

There appears to be an extra space before the UID format specifier. Removing the extra space would improve the consistency of the log formatting.

klog.Errorf("Found %d nodes with label for ComputeDomain with UID  %s: %v", len(nodes.Items), cdUID, nodeNames)

cmd/compute-domain-controller/computedomain.go:249

[nitpick] Logging the error and also returning an error for resource claims may result in duplicate error reporting. Consider choosing one approach to minimize redundant logging.

klog.Errorf("Found %d ResourceClaims for ComputeDomain with UID %s: %v", len(resourceClaims.Items), cdUID, claimNames)

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-03-24T23:19:46Z

/ok to test

guptaNswati · 2025-03-31T21:00:11Z

cmd/compute-domain-controller/indexers.go

 		ds = append(ds, d)
 	}

+	klog.V(2).Infof("Found %d objects with ComputeDomain Label with UID %s", len(ds), cdUID)


Resulting log:

I0331 20:49:32.819074 1 indexers.go:76] Found 1 objects with ComputeDomain Label with UID 2577168b-75d1-4c51-a659-d2225e2fe24f I0331 20:49:32.819083 1 indexers.go:76] Found 1 objects with ComputeDomain Label with UID 2577168b-75d1-4c51-a659-d2225e2fe24f

guptaNswati · 2025-03-31T21:00:49Z

cmd/compute-domain-controller/daemonset.go

 	}

 	if int(d.Status.NumberReady) != cd.Spec.NumNodes {
+		klog.V(2).Infof("DaemonSet %s/%s has %d ready nodes, expecting %d, waiting for all nodes to be ready", d.Namespace, d.Name, d.Status.NumberReady, cd.Spec.NumNodes)


Test log:

I0331 20:49:32.819099 1 daemonset.go:337] DaemonSet nvidia-dra-driver-gpu/nvbandwidth-test-compute-domain-trgmx has 0 ready nodes, expecting 4, waiting for all nodes to be ready

guptaNswati · 2025-03-31T21:37:36Z

cmd/compute-domain-controller/daemonset.go


 	informer := factory.Apps().V1().DaemonSets().Informer()

+	klog.Infof("Creating new DaemonSetManager for driver %s/%s", config.driverNamespace, config.driverName)


test log:

I0331 20:49:32.416950 1 daemonset.go:89] Creating new DaemonSetManager for driver nvidia-dra-driver-gpu/compute-domain.nvidia.com

guptaNswati · 2025-03-31T21:38:13Z

cmd/compute-domain-controller/controller.go

 // It initializes the work queue, starts the ComputeDomain manager, and handles
 // graceful shutdown when the context is cancelled.
 func (c *Controller) Run(ctx context.Context) error {
+	klog.Info("Starting ComputeDomain Controller")


Test:

I0331 20:49:32.416860 1 controller.go:62] Starting ComputeDomain Controller

guptaNswati · 2025-03-31T21:39:03Z

cmd/compute-domain-controller/computedomain.go

 	factory := nvinformers.NewSharedInformerFactory(config.clientsets.Nvidia, informerResyncPeriod)
 	informer := factory.Resource().V1beta1().ComputeDomains().Informer()

+	klog.Infof("Creating new ComputeDomainManager for %s/%s", config.driverName, config.driverNamespace)


Test:

I0331 20:49:32.416920 1 computedomain.go:68] Creating new ComputeDomainManager for compute-domain.nvidia.com/nvidia-dra-driver-gpu

guptaNswati · 2025-03-31T21:39:50Z

cmd/compute-domain-controller/cleanup.go

 }

 func NewCleanupManager[T metav1.Object](informer cache.SharedIndexInformer, getComputeDomain GetComputeDomainFunc, callback CleanupCallback[T]) *CleanupManager[T] {
+	klog.Infof("Creating new Cleanup Manager for %T", *new(T))


Test:

I0331 20:49:32.416970 1 cleanup.go:29] Creating new Cleanup Manager for *v1beta1.ResourceClaimTemplate I0331 20:49:32.416975 1 cleanup.go:29] Creating new Cleanup Manager for *v1.DaemonSet

guptaNswati · 2025-03-31T23:39:43Z

cmd/compute-domain-controller/controller.go

 // It initializes the work queue, starts the ComputeDomain manager, and handles
 // graceful shutdown when the context is cancelled.
 func (c *Controller) Run(ctx context.Context) error {
+	klog.Info("Starting ComputeDomain Controller")


Test:

I0331 20:49:32.416860 1 controller.go:62] Starting ComputeDomain Controller

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati · 2025-04-01T20:51:41Z

cmd/compute-domain-controller/cleanup.go

 				}

 				if computeDomain != nil {
+					klog.V(6).Infof("ComputeDomain with UID %s still exists, skipping cleanup", uid)


Found 1 items to check for cleanup I0401 20:39:32.718643 1 cleanup.go:94] ComputeDomain with UID 2577168b-75d1-4c51-a659-d2225e2fe24f still exists, skipping cleanup

guptaNswati · 2025-04-01T21:14:06Z

@klueska i like the style and organization of Run.ai logging. An example snapshot of the scheduler logs

025-04-01T21:11:16.324Z	INFO	scheduler/scheduler.go:108	[49cdc93f-8fab-4a48-aadc-0bf70b981af2] End scheduling ...
2025-04-01T21:11:17.324Z	INFO	scheduler/scheduler.go:86	[1365efa8-ccb6-476c-933a-4b5355e2e48d] Start scheduling ...
2025-04-01T21:11:17.324Z	INFO	framework/session.go:325	[1365efa8-ccb6-476c-933a-4b5355e2e48d] Taking cluster snapshot ...
2025-04-01T21:11:17.325Z	INFO	framework/session.go:336	[1365efa8-ccb6-476c-933a-4b5355e2e48d] Session 1365efa8-ccb6-476c-933a-4b5355e2e48d with <0> Jobs, <0> Queues and <4> Nodes

Add more logs to computedomain manager

cc65465

Signed-off-by: Swati Gupta <[email protected]>

already checked in the onAddOrUpdatey

7fe7ff2

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the add-debug-mode branch from 574a42d to 7fe7ff2 Compare March 12, 2025 21:47

klueska reviewed Mar 12, 2025

View reviewed changes

cmd/compute-domain-controller/computedomain.go Outdated Show resolved Hide resolved

klueska reviewed Mar 12, 2025

View reviewed changes

cmd/compute-domain-controller/computedomain.go Outdated Show resolved Hide resolved

klueska reviewed Mar 12, 2025

View reviewed changes

cmd/compute-domain-controller/computedomain.go Outdated Show resolved Hide resolved

klueska reviewed Mar 12, 2025

View reviewed changes

cmd/compute-domain-controller/computedomain.go Show resolved Hide resolved

ArangoGutierrez requested review from ArangoGutierrez and Copilot March 13, 2025 18:00

Copilot AI reviewed Mar 13, 2025

View reviewed changes

guptaNswati force-pushed the add-debug-mode branch from 50a9620 to 6af8bc2 Compare March 18, 2025 21:29

add log level 2

5321f9c

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the add-debug-mode branch 3 times, most recently from d4c402d to eb50da8 Compare March 24, 2025 22:45

more logs for the controller

6c543dc

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati force-pushed the add-debug-mode branch from eb50da8 to 6c543dc Compare March 24, 2025 23:13

guptaNswati commented Mar 31, 2025

View reviewed changes

Logs for resourceclaimtemplate

1b23a9b

Signed-off-by: Swati Gupta <[email protected]>

guptaNswati commented Apr 1, 2025

View reviewed changes

klueska added this to Planning Board: k8s-dra-driver-gpu Jun 16, 2025

github-project-automation bot moved this to Backlog in Planning Board: k8s-dra-driver-gpu Jun 16, 2025

klueska added this to the v25.12.0 milestone Aug 13, 2025

klueska added the debuggability issue/pr related to the ability to debug the system label Aug 13, 2025

klueska assigned guptaNswati Aug 14, 2025

	klog.Infof("Creating new ComputeDomainManager with config %+v", config)
	klog.Infof("Creating new ComputeDomainManager with config:\n%+v", config)


		informer := factory.Apps().V1().DaemonSets().Informer()

		klog.Infof("Creating new DaemonSetManager for driver %s/%s", config.driverNamespace, config.driverName)

Draft: Add more logs to computedomain manager #285

Are you sure you want to change the base?

Draft: Add more logs to computedomain manager #285

Uh oh!

Conversation

guptaNswati commented Mar 12, 2025

Uh oh!

copy-pr-bot bot commented Mar 12, 2025

Uh oh!

guptaNswati commented Mar 12, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

klueska Mar 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

guptaNswati commented Mar 24, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

guptaNswati commented Apr 1, 2025

Uh oh!

Uh oh!

guptaNswati Mar 12, 2025 •

edited

Loading

klueska Mar 12, 2025 •

edited

Loading