Skip to content

Conversation

jgehrcke
Copy link
Collaborator

@jgehrcke jgehrcke commented Sep 30, 2025

Resolves #609.

This PR has a set of related changes that we can also discuss in separate PRs if preferred:

  1. For better debuggability: log full component config (flags object) upon component startup (with go-render to quickly get to a stringified version of a more or less complex, nested struct -- open do using a different strategy).

  2. For control: introduction of a component-global logVerbosity Helm chart parameter, including documentation laying out the starting point for a verbosity system (comments very welcome)

  3. For less noise in default config:

    • Not as much chatter anymore during runtime (with level 1 being the default, and some proposals made for what level 1 should contain -- see logVerbosity level documentation).
    • Re-parametrization of the work queue rate limiter for prepare/unprepare retries to be slightly slower -- we do not need to retry ~10 times per second initially when expected time-to-completion is O(1 s) or slower anyway. I made this change with log verbosity being a motivation, but I believe architecturally this change also makes sense and might even be important (discussed elsewhere). I chose the current parameters almost flying blind. If in the future we ever find the work queue rate limiter to introduce unnecessary latency (as part of latency/performance tuning): let's of course adjust this again.
  4. For robustness, explicit log flushing as part of component shutdown (I think we missed this so far).

One interesting change that I propose here is to not have those "updated/added object callback" confirmations logged on the default log level in the CD controller -- I think we should try to not scale log volume with number of objects created (at least in default config).

Copy link

copy-pr-bot bot commented Sep 30, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


// Run invokes the IMEX daemon and manages its lifecycle.
func run(ctx context.Context, cancel context.CancelFunc, flags *Flags) error {
klog.Infof("config: %v", flags)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are ok with the go-render route: then I want to apply that here, too

}

// Run invokes the IMEX daemon and manages its lifecycle.
func run(ctx context.Context, cancel context.CancelFunc, flags *Flags) error {
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO, maybe in this PR, maybe in a follow-up: make the daemon consume the logVerbosity parameter, too (and init klog with that).

It's currently as noisy as before the patch in default config, containing e.g.

I0930 12:34:30.570186       1 round_trippers.go:632] "Response" verb="GET" url="https://10.96.0.1:443/apis/resource.nvidia.com/v1beta1/namespaces/default/computedomains?allowWatchBookmarks=true&fieldSelector=metadata.name%3Dimex-channel-injection-all&resourceVersion=21893750&timeout=7m41s&timeoutSeconds=461&watch=true" status="200 OK" milliseconds=1

},
Action: func(c *cli.Context) error {
ctx := c.Context
klog.Infof("config: %v", render.Render(flags))
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is how this may look like:

I0930 12:07:36.979644 1 main.go:166] config: (*main.Flags){kubeClientConfig:flags.KubeClientConfig{KubeConfig:"", KubeAPIQPS:5, KubeAPIBurst:10}, loggingConfig:(*flags.LoggingConfig){config:(*v1.LoggingConfiguration){Format:"text", FlushFrequency:v1.TimeOrMetaDuration{Duration:v1.Duration{Duration:time.Duration(5000000000)}, SerializeAsString:true}, Verbosity:v1.VerbosityLevel(1), VModule:v1.VModuleConfiguration(nil), Options:v1.FormatOptions{Text:v1.TextOptions{OutputRoutingOptions:v1.OutputRoutingOptions{SplitStream:false, InfoBufferSize:resource.QuantityValue{Quantity:resource.Quantity{i:resource.int64Amount{value:0, scale:resource.Scale(0)}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"0", Format:resource.Format("DecimalSI")}}}}, JSON:v1.JSONOptions{OutputRoutingOptions:v1.OutputRoutingOptions{SplitStream:false, InfoBufferSize:resource.QuantityValue{Quantity:resource.Quantity{i:resource.int64Amount{value:0, scale:resource.Scale(0)}, d:resource.infDecAmount{Dec:(*inf.Dec)(nil)}, s:"0", Format:resource.Format("DecimalSI")}}}}}}}, featureGateConfig:(*flags.FeatureGateConfig){}, nodeName:"gb-nvl-043-compute06", namespace:"nvidia-dra-driver-gpu", cdiRoot:"/var/run/cdi", containerDriverRoot:"/driver-root", hostDriverRoot:"/run/nvidia/driver", nvidiaCDIHookPath:"", kubeletRegistrarDirectoryPath:"/var/lib/kubelet/plugins_registry", kubeletPluginsDirectoryPath:"/var/lib/kubelet/plugins", healthcheckPort:51515}

Looks thick, but after all it's a condensed version of details that matter, such as

Verbosity:v1.VerbosityLevel(1)
healthcheckPort:51515
featureGateConfig:(*flags.FeatureGateConfig){}

Even the KubeClientConfig may be relevant at times.

I'm not married to this though, just tried to find a pragmatic way and I can see that go-render alone is a discussion point.

Signed-off-by: Dr. Jan-Philip Gehrcke <[email protected]>
@jgehrcke jgehrcke moved this from Backlog to In Progress in Planning Board: k8s-dra-driver-gpu Sep 30, 2025
@jgehrcke jgehrcke added this to the v25.8.0 milestone Sep 30, 2025
@jgehrcke jgehrcke added usability issue/pr related to UX debuggability issue/pr related to the ability to debug the system labels Sep 30, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debuggability issue/pr related to the ability to debug the system usability issue/pr related to UX
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Allow verbosity of kubelet plugins and controller to be set via helm
1 participant