fix(service): follow event schedule interval #5848

TobyTheHutt · 2025-09-18T08:26:28Z

What does it do ?

This PR introduces changes to cover the scheduling contract for events.

Add per-source tracking of when node data is required
Add lazy node handler registration
Introduce node-address diffing handler to prevent heartbeat-only trigger
Adjust service informer tests for new detection path and explicit node-event opt-in

Partial fix for Ticket: #5796

What is NOT implemented here, is proper controller throttle so events schedule at max(lastRun+interval, now+minSync). To prevent mix of concerns, this will be added in a separate PR.
As a direct consequence, also the TestShouldRunOnce has not been changed to cover a recommended, stricter scheduling contract.

Motivation

With 0.19.0 the service source started wiring the --events handler to the shared node informer. Since the default service type filter is empty, every node is observed. This means, the handler now fires on every node status heartbeat.

The motivation is to re-enforce the scheduler contract and ensure predictable behaviour for event schedule intervals.

More

Yes, this PR title follows Conventional Commits
Yes, I added unit tests
Yes, I updated end user documentation accordingly

This PR was tested in a virtual environment. For brevity, I post the setup and results in a comment.

* Update controller throttle so events schedule at `max(lastRun+interval, now+minSync)` * Rework `TestShouldRunOnce` flow to cover the stricter scheduling contract * Add per-source tracking of when node data is required * Add lazy node handler registration * Introduce node-address diffing handler to prevent heartbeat-only trigger * Adjust service informer tests for new detection path and explicit node-event opt-in Signed-off-by: Tobias Harnickell <[email protected]>

k8s-ci-robot · 2025-09-18T08:26:34Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

k8s-ci-robot · 2025-09-18T08:26:38Z

Hi @TobyTheHutt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

TobyTheHutt · 2025-09-18T08:30:28Z

As mentioned in the PR message, here's the test setup and the tests.

The tests basically:

Install a basic externalDNS setup
- Add --events
- Add --interval=10m
Start a simple loop to create node churn
Watch the logs

Setup:

Prepare the cluster:

kubectl create ns dns
helm repo add bitnami https://charts.bitnami.com/bitnami
helm -n dns install etcd bitnami/etcd --set auth.rbac.create=false

Create file coredns-values.yaml:

isClusterService: false
serviceType: ClusterIP
servers:
  - zones:
      - zone: example.test.
    port: 53
    plugins:
      - name: errors
      - name: log
      - name: etcd
        parameters: example.test
        configBlock: |
          stubzones
          path /skydns
          endpoint http://etcd.dns.svc.cluster.local:2379
      - name: forward
        parameters: . /etc/resolv.conf
      - name: cache
        parameters: 30
      - name: health
      - name: ready

Create coredns instance:
helm -n dns install coredns-ext coredns/coredns -f coredns-values.yaml

Create & deploy extdns.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services","endpoints","pods","nodes"]
  verbs: ["get","watch","list"]
- apiGroups: ["networking.k8s.io","discovery.k8s.io","extensions"]
  resources: ["ingresses","endpointslices"]
  verbs: ["get","watch","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: extdns:dev
        imagePullPolicy: Never
        args:
          - --provider=coredns
          - --source=ingress
          - --source=service
          - --interval=10m
          - --events
          - --domain-filter=example.test
          - --log-level=debug
        env:
        - name: ETCD_URLS
          value: http://etcd.dns.svc.cluster.local:2379

Create and deploy a noise-free service:

apiVersion: v1
kind: Service
metadata:
  name: hello
  annotations:
    external-dns.alpha.kubernetes.io/hostname: hello.example.test
spec:
  type: ClusterIP
  selector:
    app: demo
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
      - name: web
        image: nginx:alpine
        ports:
        - containerPort: 8080

Generate some churn:

while true; do
  for n in $(kubectl get nodes -o name); do
    kubectl label $n edns-churn=$(date +%s%N) --overwrite >/dev/null 2>&1 || true
  done
  sleep 2
done

Watch logs:

kubectl logs -n dns deploy/external-dns -f

Result:

Logs:

ivankatliarchuk · 2025-09-18T10:02:30Z

/ok-to-test

ivankatliarchuk · 2025-09-18T15:14:42Z

From first look, the solution in service.go file looks over complicated. Maybe we just remove

if sc.serviceTypeFilter.isRequired(v1.ServiceTypeNodePort) {
		_, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler))
	}

Basically, there is a _, _ = sc.serviceInformer.Informer().AddEventHandler(eventHandlerFunc(handler)), which will eventually capture all the services created,updated, deleted. The line

external-dns/source/service.go

Line 185 in 413015e

_, _ = nodeInformer.Informer().AddEventHandler(informers.DefaultEventHandler())

should do the magic with resync all the nodes if/when required.

Changes in controller.go not sure if directly related, so could be added as refactoring PR

TobyTheHutt · 2025-09-18T17:25:38Z

I agree that it's a large file, though I did not take a time to evaluate what is overall "clutter" or unnecessarily complexity. To the relevant lines:

services
As far as I understand it, keeping only the referenced DefaultEventHandler() on the node informer would not trigger reconciles on node address changes. This is why I created the nodeAddressChangeHandler.

What I could do though is simplify the code by dropping detectNodeDependentServices and just register on first markNodeEventsNeeded(). But this would mean that the first run is delayed by one cycle.

controller
The goal of my controller changes was to tighten the scheduling, so it's not run before max(lastRun+Interval, now+MinEventSyncInterval). In the end, it's supposed to reduce reduce unnecessary event-triggered reconciles so the controller's throttle does not get overridden by noisy node updates.
Without the changes on the controller, we wouldn't filter node events as strictly, which would potentially lead to frequent batches despite the flags supplied.

Let me know what you think, or if I misunderstood anything from your feedback.

UPDATE: I gave the feedback some more thought, especially regarding the bulk of my changes. I think I can still drop lazy detection & flags, and unconditionally register a diffing node handler for when a feature depends on node data. I'll push a new commit for that, along with new test results.

UPDATE2: I just understood that with your last comment, you likely wanted to bring up the mix of concerns I added in this one PR. I agree and will split this PR so the commit history in the repo remains clean.

Reverts controller changes from ccd3cb9 to separate scheduling fix from service logic updates. Introduced new service logic: * Lazy register node informer only when needed * Add node address change handler with minimal comparisons * support detection of node-dependent services before enabling node events * Update service tests Signed-off-by: Tobias Harnickell <[email protected]>

Signed-off-by: Tobias Harnickell <[email protected]>

TobyTheHutt · 2025-09-19T12:38:43Z

New test results show behaviour as expected. Generally, interval is respected, but overridden on:

Headless service events
NodePort service creation/deletion
Endpoint slice churn
Node address changes

Plus, it does not currently measure any min vs interval contradiction under genuine event pressure.

Interval in silent cluster (no service or ingress changes):

Interval in noisy cluster (see script below):

Churn script:

SVC=test-extdns
kubectl create service clusterip $SVC --tcp=80:80 2>/dev/null || true
while true; do
  kubectl annotate service $SVC external-dns.alpha.kubernetes.io/ttl=$(shuf -i 30-300 -n1) --overwrite
  sleep 2
done

ivankatliarchuk · 2025-09-22T09:01:37Z

I know there is an effort added for this solution. Same time I'm on the edge

The bug/behavior-change was introduced unintentionally here https://github.com/kubernetes-sigs/external-dns/pull/5613/files#diff-cf68f602fa7c20e5341f3b83054df68ade1586a144b1eae5347e0ac47096d3aa aka added event handler without any explicit requirement. Usually we do split refactoring and behaviour change. But as it triggers reconciles more often then required, probably not worth the benefits.

Pros/cons if simple remove _, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler)) from AddEventHandler

Pros

Bug fixed

Cons

Service with NodePort interval will rely on default or service informer.
NodePort changes to become less responsive, as in <= 0.18 version.

What is questioned in current implementation

The --interval=10m logic does not work as it should, due to fact that Node is updated quite regularly and --events flag present. Very same is about to happen if Service or EndpointSlices are about to be updated quite often.
Example: if every node in the cluster registers with the NodePort service, the problem won’t be resolved. It would be quite noisy, and we can’t assume this case is rare or impossible
If node source is enabled, the reconciler will triggered on every node change regardless.

The other solution is to check where node source is enabled

if sc.serviceTypeFilter.isRequired(v1.ServiceTypeNodePort) && sc.isNodeSourceEnabled {
		_, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler))
	}

From high level view, there is nothing wrong, as the external-dns arguments are;

--interval=10m
--events

In theory they are mutually exclusive. You could not have 10 minutes in between reconciliations and rely on kubernetes events. As described here https://github.com/kubernetes-sigs/external-dns/blob/master/docs/advanced/rate-limits.md#introduction

So let's w8 for other reviewers. If we decided to add this fix to a codebase, most likely worth to move changes to source/handlers folder.

k8s-ci-robot requested a review from mloiseleur September 18, 2025 08:26

k8s-ci-robot added the controller Issues or PRs related to the controller label Sep 18, 2025

k8s-ci-robot requested a review from szuecs September 18, 2025 08:26

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. source needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2025

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 18, 2025

mloiseleur changed the title ~~fix(controller): Correct event schedule interval~~ fix(controller): respect event schedule interval Sep 18, 2025

mloiseleur changed the title ~~fix(controller): respect event schedule interval~~ fix(controller): follow event schedule interval Sep 18, 2025

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2025

TobyTheHutt added 2 commits September 19, 2025 12:40

fix(service): Run linter

a7b49c4

Signed-off-by: Tobias Harnickell <[email protected]>

TobyTheHutt changed the title ~~fix(controller): follow event schedule interval~~ fix(service): follow event schedule interval Sep 19, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(service): follow event schedule interval #5848

fix(service): follow event schedule interval #5848

TobyTheHutt commented Sep 18, 2025 •

edited

Loading

Uh oh!

k8s-ci-robot commented Sep 18, 2025

Uh oh!

k8s-ci-robot commented Sep 18, 2025

Uh oh!

TobyTheHutt commented Sep 18, 2025 •

edited

Loading

Uh oh!

ivankatliarchuk commented Sep 18, 2025

Uh oh!

ivankatliarchuk commented Sep 18, 2025

Uh oh!

TobyTheHutt commented Sep 18, 2025 •

edited

Loading

Uh oh!

TobyTheHutt commented Sep 19, 2025

Uh oh!

ivankatliarchuk commented Sep 22, 2025

Uh oh!

Uh oh!

fix(service): follow event schedule interval #5848

Are you sure you want to change the base?

fix(service): follow event schedule interval #5848

Conversation

TobyTheHutt commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does it do ?

Motivation

More

Uh oh!

k8s-ci-robot commented Sep 18, 2025

Uh oh!

k8s-ci-robot commented Sep 18, 2025

Uh oh!

TobyTheHutt commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ivankatliarchuk commented Sep 18, 2025

Uh oh!

ivankatliarchuk commented Sep 18, 2025

Uh oh!

TobyTheHutt commented Sep 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TobyTheHutt commented Sep 19, 2025

Uh oh!

ivankatliarchuk commented Sep 22, 2025

Uh oh!

Uh oh!

TobyTheHutt commented Sep 18, 2025 •

edited

Loading

TobyTheHutt commented Sep 18, 2025 •

edited

Loading

TobyTheHutt commented Sep 18, 2025 •

edited

Loading