Skip to content

Conversation

TobyTheHutt
Copy link
Contributor

@TobyTheHutt TobyTheHutt commented Sep 18, 2025

What does it do ?

This PR introduces changes to cover the scheduling contract for events.

  • Add per-source tracking of when node data is required
  • Add lazy node handler registration
  • Introduce node-address diffing handler to prevent heartbeat-only trigger
  • Adjust service informer tests for new detection path and explicit node-event opt-in

Partial fix for Ticket: #5796

What is NOT implemented here, is proper controller throttle so events schedule at max(lastRun+interval, now+minSync). To prevent mix of concerns, this will be added in a separate PR.
As a direct consequence, also the TestShouldRunOnce has not been changed to cover a recommended, stricter scheduling contract.

Motivation

With 0.19.0 the service source started wiring the --events handler to the shared node informer. Since the default service type filter is empty, every node is observed. This means, the handler now fires on every node status heartbeat.

The motivation is to re-enforce the scheduler contract and ensure predictable behaviour for event schedule intervals.

More

  • Yes, this PR title follows Conventional Commits
  • Yes, I added unit tests
  • Yes, I updated end user documentation accordingly

This PR was tested in a virtual environment. For brevity, I post the setup and results in a comment.

* Update controller throttle so events schedule at `max(lastRun+interval,
  now+minSync)`
* Rework `TestShouldRunOnce` flow to cover the stricter scheduling
  contract
* Add per-source tracking of when node data is required
* Add lazy node handler registration
* Introduce node-address diffing handler to prevent heartbeat-only
  trigger
* Adjust service informer tests for new detection path and explicit
  node-event opt-in

Signed-off-by: Tobias Harnickell <[email protected]>
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign ivankatliarchuk for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the controller Issues or PRs related to the controller label Sep 18, 2025
@k8s-ci-robot k8s-ci-robot requested a review from szuecs September 18, 2025 08:26
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. source needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @TobyTheHutt. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Sep 18, 2025
@TobyTheHutt
Copy link
Contributor Author

TobyTheHutt commented Sep 18, 2025

As mentioned in the PR message, here's the test setup and the tests.

The tests basically:

  • Install a basic externalDNS setup
    • Add --events
    • Add --interval=10m
  • Start a simple loop to create node churn
  • Watch the logs

Setup:

Prepare the cluster:

kubectl create ns dns
helm repo add bitnami https://charts.bitnami.com/bitnami
helm -n dns install etcd bitnami/etcd --set auth.rbac.create=false

Create file coredns-values.yaml:

isClusterService: false
serviceType: ClusterIP
servers:
  - zones:
      - zone: example.test.
    port: 53
    plugins:
      - name: errors
      - name: log
      - name: etcd
        parameters: example.test
        configBlock: |
          stubzones
          path /skydns
          endpoint http://etcd.dns.svc.cluster.local:2379
      - name: forward
        parameters: . /etc/resolv.conf
      - name: cache
        parameters: 30
      - name: health
      - name: ready

Create coredns instance:
helm -n dns install coredns-ext coredns/coredns -f coredns-values.yaml

Create & deploy extdns.yaml:

apiVersion: v1
kind: ServiceAccount
metadata:
  name: external-dns
  namespace: dns
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: external-dns
rules:
- apiGroups: [""]
  resources: ["services","endpoints","pods","nodes"]
  verbs: ["get","watch","list"]
- apiGroups: ["networking.k8s.io","discovery.k8s.io","extensions"]
  resources: ["ingresses","endpointslices"]
  verbs: ["get","watch","list"]
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: external-dns-viewer
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: external-dns
subjects:
- kind: ServiceAccount
  name: external-dns
  namespace: dns
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: external-dns
  namespace: dns
spec:
  strategy:
    type: Recreate
  selector:
    matchLabels:
      app: external-dns
  template:
    metadata:
      labels:
        app: external-dns
    spec:
      serviceAccountName: external-dns
      containers:
      - name: external-dns
        image: extdns:dev
        imagePullPolicy: Never
        args:
          - --provider=coredns
          - --source=ingress
          - --source=service
          - --interval=10m
          - --events
          - --domain-filter=example.test
          - --log-level=debug
        env:
        - name: ETCD_URLS
          value: http://etcd.dns.svc.cluster.local:2379

Create and deploy a noise-free service:

apiVersion: v1
kind: Service
metadata:
  name: hello
  annotations:
    external-dns.alpha.kubernetes.io/hostname: hello.example.test
spec:
  type: ClusterIP
  selector:
    app: demo
  ports:
  - port: 80
    targetPort: 8080
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: demo
spec:
  replicas: 1
  selector:
    matchLabels:
      app: demo
  template:
    metadata:
      labels:
        app: demo
    spec:
      containers:
      - name: web
        image: nginx:alpine
        ports:
        - containerPort: 8080

Generate some churn:

while true; do
  for n in $(kubectl get nodes -o name); do
    kubectl label $n edns-churn=$(date +%s%N) --overwrite >/dev/null 2>&1 || true
  done
  sleep 2
done

Watch logs:

kubectl logs -n dns deploy/external-dns -f

Result:

Logs:
Screenshot 2025-09-18 095949

@mloiseleur mloiseleur changed the title fix(controller): Correct event schedule interval fix(controller): respect event schedule interval Sep 18, 2025
@mloiseleur mloiseleur changed the title fix(controller): respect event schedule interval fix(controller): follow event schedule interval Sep 18, 2025
@ivankatliarchuk
Copy link
Contributor

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 18, 2025
@ivankatliarchuk
Copy link
Contributor

From first look, the solution in service.go file looks over complicated. Maybe we just remove

if sc.serviceTypeFilter.isRequired(v1.ServiceTypeNodePort) {
		_, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler))
	}

Basically, there is a _, _ = sc.serviceInformer.Informer().AddEventHandler(eventHandlerFunc(handler)), which will eventually capture all the services created,updated, deleted. The line

_, _ = nodeInformer.Informer().AddEventHandler(informers.DefaultEventHandler())
should do the magic with resync all the nodes if/when required.

Changes in controller.go not sure if directly related, so could be added as refactoring PR

@TobyTheHutt
Copy link
Contributor Author

TobyTheHutt commented Sep 18, 2025

I agree that it's a large file, though I did not take a time to evaluate what is overall "clutter" or unnecessarily complexity. To the relevant lines:

services
As far as I understand it, keeping only the referenced DefaultEventHandler() on the node informer would not trigger reconciles on node address changes. This is why I created the nodeAddressChangeHandler.

What I could do though is simplify the code by dropping detectNodeDependentServices and just register on first markNodeEventsNeeded(). But this would mean that the first run is delayed by one cycle.

controller
The goal of my controller changes was to tighten the scheduling, so it's not run before max(lastRun+Interval, now+MinEventSyncInterval). In the end, it's supposed to reduce reduce unnecessary event-triggered reconciles so the controller's throttle does not get overridden by noisy node updates.
Without the changes on the controller, we wouldn't filter node events as strictly, which would potentially lead to frequent batches despite the flags supplied.

Let me know what you think, or if I misunderstood anything from your feedback.

UPDATE: I gave the feedback some more thought, especially regarding the bulk of my changes. I think I can still drop lazy detection & flags, and unconditionally register a diffing node handler for when a feature depends on node data. I'll push a new commit for that, along with new test results.

UPDATE2: I just understood that with your last comment, you likely wanted to bring up the mix of concerns I added in this one PR. I agree and will split this PR so the commit history in the repo remains clean.

Reverts controller changes from ccd3cb9 to separate scheduling fix
from service logic updates.
Introduced new service logic:
* Lazy register node informer only when needed
* Add node address change handler with minimal comparisons
* support detection of node-dependent services before enabling node
  events
* Update service tests

Signed-off-by: Tobias Harnickell <[email protected]>
Signed-off-by: Tobias Harnickell <[email protected]>
@TobyTheHutt TobyTheHutt changed the title fix(controller): follow event schedule interval fix(service): follow event schedule interval Sep 19, 2025
@TobyTheHutt
Copy link
Contributor Author

New test results show behaviour as expected. Generally, interval is respected, but overridden on:

  • Headless service events
  • NodePort service creation/deletion
  • Endpoint slice churn
  • Node address changes

Plus, it does not currently measure any min vs interval contradiction under genuine event pressure.

Interval in silent cluster (no service or ingress changes):

image

Interval in noisy cluster (see script below):

image

Churn script:

SVC=test-extdns
kubectl create service clusterip $SVC --tcp=80:80 2>/dev/null || true
while true; do
  kubectl annotate service $SVC external-dns.alpha.kubernetes.io/ttl=$(shuf -i 30-300 -n1) --overwrite
  sleep 2
done

@ivankatliarchuk
Copy link
Contributor

I know there is an effort added for this solution. Same time I'm on the edge

The bug/behavior-change was introduced unintentionally here https://github.com/kubernetes-sigs/external-dns/pull/5613/files#diff-cf68f602fa7c20e5341f3b83054df68ade1586a144b1eae5347e0ac47096d3aa aka added event handler without any explicit requirement. Usually we do split refactoring and behaviour change. But as it triggers reconciles more often then required, probably not worth the benefits.

Pros/cons if simple remove _, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler)) from AddEventHandler

Pros

  • Bug fixed

Cons

  • Service with NodePort interval will rely on default or service informer.
  • NodePort changes to become less responsive, as in <= 0.18 version.

What is questioned in current implementation

  • The --interval=10m logic does not work as it should, due to fact that Node is updated quite regularly and --events flag present. Very same is about to happen if Service or EndpointSlices are about to be updated quite often.
  • Example: if every node in the cluster registers with the NodePort service, the problem won’t be resolved. It would be quite noisy, and we can’t assume this case is rare or impossible
  • If node source is enabled, the reconciler will triggered on every node change regardless.

The other solution is to check where node source is enabled

if sc.serviceTypeFilter.isRequired(v1.ServiceTypeNodePort) && sc.isNodeSourceEnabled {
		_, _ = sc.nodeInformer.Informer().AddEventHandler(eventHandlerFunc(handler))
	}

From high level view, there is nothing wrong, as the external-dns arguments are;

--interval=10m
--events

In theory they are mutually exclusive. You could not have 10 minutes in between reconciliations and rely on kubernetes events. As described here https://github.com/kubernetes-sigs/external-dns/blob/master/docs/advanced/rate-limits.md#introduction

So let's w8 for other reviewers. If we decided to add this fix to a codebase, most likely worth to move changes to source/handlers folder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. controller Issues or PRs related to the controller ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. source
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants