Skip to content

Conversation

guptaNswati
Copy link
Contributor

This is the beginning of adding more logging to the code. Some of it like the functioning of compute domain manager can be in the default mode and others like deamonset manager and resourceclaim template manager can be in the verbose mode. I will add a debug mode later.

If we feel even this is too verbose, i can move them to debug mode.

Copy link

copy-pr-bot bot commented Mar 12, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@guptaNswati
Copy link
Contributor Author

/ok to test

Comment on lines 231 to 253
// check if all resource claims for workloads are gone
cd, err := m.Get(cdUID)
if err != nil {
return fmt.Errorf("error retrieving ComputeDomain: %w", err)
}

resourceClaims, err := m.config.clientsets.Core.ResourceV1beta1().ResourceClaims(cd.Namespace).List(ctx, metav1.ListOptions{
LabelSelector: metav1.FormatLabelSelector(labelSelector),
})
if err != nil {
return fmt.Errorf("error retrieving ResourceClaims: %w", err)
}

if len(resourceClaims.Items) != 0 {
claimNames := []string{}
for _, claim := range resourceClaims.Items {
claimNames = append(claimNames, claim.Name)
}
klog.Errorf("Found %d ResourceClaims for ComputeDomain with UID %s: %v",
len(resourceClaims.Items), cdUID, claimNames)
return fmt.Errorf("ResourceClaims exist for ComputeDomain %s", cdUID)
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not related to adding logs and should be in its own PR.

That said, as currently implemented this is a NOOP -- the ResourceClaims generated from our ResourceClaimTemplate for workloads will never have the ComputeDomain label applied to them, so the list operation will always return an empty list.

Copy link
Contributor Author

@guptaNswati guptaNswati Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes agree. this should be in a separate PR. I just added this here after i added the iterations on node.Names. It was based on the list of nodes logic. And anyway i spent more time adding this part than the logs itself.

Oh, i checked this https://github.com/NVIDIA/k8s-dra-driver-gpu/blob/main/templates/compute-domain-daemon-claim-template.tmpl.yaml#L10 and thought this label will be available in all the resourceclaim templates

Oh i see, its an empty field at the time of generation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was actually more relevant on an older iteration of the code. The right way to guarantee that all workloads have stopped now is to wait until no nodes have the ComputeDomain label anymore. This is sufficient because the workload pods remove this label as they shutdown.

return fmt.Errorf("error retrieving ComputeDomain: %w", err)
}
if cd == nil {
klog.Infof("ComputeDomain with UID %s not found, nothing to do", uid)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be (at least) verbose level 2.

factory := nvinformers.NewSharedInformerFactory(config.clientsets.Nvidia, informerResyncPeriod)
informer := factory.Resource().V1beta1().ComputeDomains().Informer()

klog.Infof("Creating new ComputeDomainManager with config %+v", config)
Copy link
Collaborator

@klueska klueska Mar 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
klog.Infof("Creating new ComputeDomainManager with config %+v", config)
klog.Infof("Creating new ComputeDomainManager with config:\n%+v", config)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually not giving much valuable info
Creating new ComputeDomainManager with config &{driverName:compute-domain.nvidia.com driverNamespace:nvidia-dra-driver-gpu clientsets: {Core:0x4000685a40 Nvidia:0x40003c8c50} workQueue:0x4000612020}

may be expand on the subfields?

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request adds several log statements to the ComputeDomainManager to provide more visibility into its operations. Key changes include:

  • Logging configuration details upon creation of ComputeDomainManager.
  • Enhanced logging for cases when ComputeDomains are not found or when finalizers are missing.
  • Detailed logs in the AssertWorkloadsCompleted function for both node labels and resource claims.
Comments suppressed due to low confidence (2)

cmd/compute-domain-controller/computedomain.go:227

  • There appears to be an extra space before the UID format specifier. Removing the extra space would improve the consistency of the log formatting.
klog.Errorf("Found %d nodes with label for ComputeDomain with UID  %s: %v", len(nodes.Items), cdUID, nodeNames)

cmd/compute-domain-controller/computedomain.go:249

  • [nitpick] Logging the error and also returning an error for resource claims may result in duplicate error reporting. Consider choosing one approach to minimize redundant logging.
klog.Errorf("Found %d ResourceClaims for ComputeDomain with UID %s: %v", len(resourceClaims.Items), cdUID, claimNames)

Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati guptaNswati force-pushed the add-debug-mode branch 3 times, most recently from d4c402d to eb50da8 Compare March 24, 2025 22:45
Signed-off-by: Swati Gupta <[email protected]>
@guptaNswati
Copy link
Contributor Author

/ok to test

ds = append(ds, d)
}

klog.V(2).Infof("Found %d objects with ComputeDomain Label with UID %s", len(ds), cdUID)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resulting log:

I0331 20:49:32.819074       1 indexers.go:76] Found 1 objects with ComputeDomain Label with UID 2577168b-75d1-4c51-a659-d2225e2fe24f
I0331 20:49:32.819083       1 indexers.go:76] Found 1 objects with ComputeDomain Label with UID 2577168b-75d1-4c51-a659-d2225e2fe24f

}

if int(d.Status.NumberReady) != cd.Spec.NumNodes {
klog.V(2).Infof("DaemonSet %s/%s has %d ready nodes, expecting %d, waiting for all nodes to be ready", d.Namespace, d.Name, d.Status.NumberReady, cd.Spec.NumNodes)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test log:

I0331 20:49:32.819099       1 daemonset.go:337] DaemonSet nvidia-dra-driver-gpu/nvbandwidth-test-compute-domain-trgmx has 0 ready nodes, expecting 4, waiting for all nodes to be ready


informer := factory.Apps().V1().DaemonSets().Informer()

klog.Infof("Creating new DaemonSetManager for driver %s/%s", config.driverNamespace, config.driverName)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test log:

I0331 20:49:32.416950       1 daemonset.go:89] Creating new DaemonSetManager for driver nvidia-dra-driver-gpu/compute-domain.nvidia.com

// It initializes the work queue, starts the ComputeDomain manager, and handles
// graceful shutdown when the context is cancelled.
func (c *Controller) Run(ctx context.Context) error {
klog.Info("Starting ComputeDomain Controller")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test:

I0331 20:49:32.416860       1 controller.go:62] Starting ComputeDomain Controller

factory := nvinformers.NewSharedInformerFactory(config.clientsets.Nvidia, informerResyncPeriod)
informer := factory.Resource().V1beta1().ComputeDomains().Informer()

klog.Infof("Creating new ComputeDomainManager for %s/%s", config.driverName, config.driverNamespace)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test:

I0331 20:49:32.416920       1 computedomain.go:68] Creating new ComputeDomainManager for compute-domain.nvidia.com/nvidia-dra-driver-gpu

}

func NewCleanupManager[T metav1.Object](informer cache.SharedIndexInformer, getComputeDomain GetComputeDomainFunc, callback CleanupCallback[T]) *CleanupManager[T] {
klog.Infof("Creating new Cleanup Manager for %T", *new(T))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test:

I0331 20:49:32.416970       1 cleanup.go:29] Creating new Cleanup Manager for *v1beta1.ResourceClaimTemplate
I0331 20:49:32.416975       1 cleanup.go:29] Creating new Cleanup Manager for *v1.DaemonSet

// It initializes the work queue, starts the ComputeDomain manager, and handles
// graceful shutdown when the context is cancelled.
func (c *Controller) Run(ctx context.Context) error {
klog.Info("Starting ComputeDomain Controller")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test:

I0331 20:49:32.416860       1 controller.go:62] Starting ComputeDomain Controller

}

if computeDomain != nil {
klog.V(6).Infof("ComputeDomain with UID %s still exists, skipping cleanup", uid)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found 1 items to check for cleanup
I0401 20:39:32.718643       1 cleanup.go:94] ComputeDomain with UID 2577168b-75d1-4c51-a659-d2225e2fe24f still exists, skipping cleanup

@guptaNswati
Copy link
Contributor Author

@klueska i like the style and organization of Run.ai logging. An example snapshot of the scheduler logs

025-04-01T21:11:16.324Z	INFO	scheduler/scheduler.go:108	[49cdc93f-8fab-4a48-aadc-0bf70b981af2] End scheduling ...
2025-04-01T21:11:17.324Z	INFO	scheduler/scheduler.go:86	[1365efa8-ccb6-476c-933a-4b5355e2e48d] Start scheduling ...
2025-04-01T21:11:17.324Z	INFO	framework/session.go:325	[1365efa8-ccb6-476c-933a-4b5355e2e48d] Taking cluster snapshot ...
2025-04-01T21:11:17.325Z	INFO	framework/session.go:336	[1365efa8-ccb6-476c-933a-4b5355e2e48d] Session 1365efa8-ccb6-476c-933a-4b5355e2e48d with <0> Jobs, <0> Queues and <4> Nodes

@klueska klueska added this to the v25.12.0 milestone Aug 13, 2025
@klueska klueska added the debuggability issue/pr related to the ability to debug the system label Aug 13, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debuggability issue/pr related to the ability to debug the system
Projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

2 participants